Aster Graphs with R and toaster

How Graph Data is Stored in R and Aster

Graph Data in Aster

Teradata Aster is relational data store that uses relational tables for all data structures. By convention, every graph should be represented using two tables:

the vertices table must contain unique vertex key and optional columns with vertex attributes
the edges table must contain source and target columns (referencing vertex keys) and optional columns with edge attributes

In turn, every SQL-GR analytical function expects at minimum the following specification:

the vertex table, view or SQL query with the vertex key
the edges table, view or SQL query with target and source keys
if graph is directed or not
if edges have weight attribute

toaster Graph Metadata Object

To simplify and streamline graph operations in Aster toaster provides specialized object toagraph that holds metadata about Aster graph. The graph metadata consists of the following:

vertices the name of table or view, or SQL query representing the vertices
edges the name of table or view, or SQL query representing the edges
directed logical flag indicating if graph is directed or not
key the name of a unique column (key) in the vertex table or query
source the name of a source column in the edge table or query
target the name of a target column in the edge table or query
vertexAttrnames the names of vertex attributes (optional)
edgeAttrnames the names of edge attributes (optional)
vertexWhere a SQL expression to filter vertices inside WHERE clause
edgeWhere a SQL expression to filter edges inside WHERE clause

Examples of two graphs defined on Open Dallas Police data and used in all examples that follow:

# undirected graph of police officers
policeGraphUn = toaGraph(vertices = "dallaspolice_officer_vertices", 
                         edges = "dallaspolice_officer_edges_un", 
                         directed = FALSE,
                         key = "officer", source = "officer1", target = "officer2", 
                         vertexAttrnames = c("offense_count"),
                         edgeAttrnames = c("weight"))

# directed graph of police officers
policeGraphDi = toaGraph(vertices = "dallaspolice_officer_vertices", 
                         edges = "dallaspolice_officer_edges_di", 
                         directed = TRUE,
                         key = "officer", source = "officer1", target = "officer2", 
                         vertexAttrnames = c("offense_count"),
                         edgeAttrnames = c("weight"))

Note, that toagraph stores only metadata and requires no ODBC connectivity.

R Network Object and toaster

In toaster we always use de facto R standards to hold in R results of computations in Aster. Examples of such approach were using kmeans, lm, or DocumentTermMatrix in corresponding computation functions. The same applied to graph with the choice between igraph and network objects from the same packages. The choice of network object is not obvious and you can find details here. Moreover, using package intergraph one can always use them interchangeably (more examples later).

Function computeGraph transfers graph from the Aster database into R memory as network object where it can be used to visualize graph using ggplot2, igraph and other graph visualization functions. The simplest invocation of computeGraph may look like this:

netPoliceUn = computeGraph(conn, policeGraphUn)

GGally::ggnet2(netPoliceUn, node.label="vertex.names", node.size="offense_count", 
               legend.position="none", label.size=3)

To finish let’s visualize the same graph with igraph:

igraphPoliceUn = asIgraph(netPoliceUn)
plot(igraphPoliceUn)

Graph Metrics

With toaster most of the graph functions that compute various centrality indices and other metrics are available with just two functions: computeGraphHistogram and computeGraphMetric. Currently, the following graph metrics are supported:

Vertex degree (all, in, out)
Local clustering coefficient (directed and undirected)
Shortest path between pairs of vertices
PageRank
Betweenness centrality
Eigenvector centrality
Closeness centrality
K-degree

Graph Histograms

Having compact and repeatable capability to visualize and profile various graph metrics lets toaster compliment the power of Aster’s big data distributed database with simple workflow delivering effective results . The computeGraphHistogram computes a distribution for a given metric over graph vertices and returns result to use with createHistogram.

Vertex degree distribution

hdegreePolice = computeGraphHistogram(conn, policeGraphUn, type='degree', numbins=36) 
createHistogram(hdegreePolice, 
                title = "Dallas Police Graph Degree Distribution", 
                xlab='Degree', ylab='Count',
                themeExtra = theme(axis.text.x = element_text(angle = 0)))

Distribution of the distances between pairs of vertices:

hshortestpathPolice = computeGraphHistogram(conn, policeGraphUn, type='shortestpath',
                                numbins = 10)
createHistogram(hshortestpathPolice, 
                title = "Dallas Police Shortest Path Distribution", 
                xlab = "Distance", ylab = "Count",
                themeExtra = theme(axis.text.x = element_text(angle = 0)))

Distribution of the graph local clustering coefficients:

# Degree Distribution 
hclusterPolice = computeGraphHistogram(conn, policeGraphUn, type='clustering', 
                                       numbins = 100)
createHistogram(hclusterPolice, 
                title = "Dallas Police Clustering Coefficients Distribution", 
                xlab = "Clustering Coefficient", ylab = "Count",
                breaks=c(seq(0.0, 0.99, 0.1),0.99),
                themeExtra = theme(axis.text.x = element_text(angle = 0)))

Centrality Measure Distributions

PageRank

hpagerankPolice = computeGraphHistogram(conn, policeGraphDi, type='pagerank', 
                                        numbins = 100)
createHistogram(hpagerankPolice, title="Dallas Police PageRank Distribution", 
                xlab="PageRank", ylab = "Count",
                breaks=seq(0.0, 0.00140, 0.0001),
                themeExtra = theme(axis.text.x = element_text(angle = 0)))

Betweenness

hbetweenPolice = computeGraphHistogram(conn, policeGraphUn, type='betweenness', 
                                       numbins = 100, startvalue=0, endvalue=5000)
createHistogram(hbetweenPolice, title = "Dallas Police Betweenness Distribution", 
                xlab = "Betweenness", ylab = "Count",
                breaks=seq(0,4450,300),
                themeExtra = theme(axis.text.x = element_text(angle = 0)))

Closeness

hclosenessPolice = computeGraphHistogram(conn, policeGraphUn, type='closeness', 
                                         numbins=100)
createHistogram(hclosenessPolice, title = "Dallas Police Closeness Distribution", 
                xlab = "Closeness", ylab = "Count",
                breaks=seq(0,1,0.05),
                themeExtra = theme(axis.text.x = element_text(angle = 0)))

K-degree

hkdegreePolice = computeGraphHistogram(conn, policeGraphUn, type='k-degree', 
                                       numbins = 100)
createHistogram(hkdegreePolice, title = "Dallas Police K-degree Distribution", 
                xlab = "K-degree", ylab = "Count",
                breaks=seq(0,2200,200),
                themeExtra = theme(axis.text.x = element_text(angle = 0)))

Top Vertices by Graph Metrics

Analyzing graph distributions is effective first step but it lacks precision of zeroing in on concrete graph vertices. Certain metrics, such as PageRank or Betweenness Centrality, often carry critical or significant value in the underlying network. Thus, determining vertices with top values for a given metric is next natural step for graph analysis.

For this purpose toaster offers 2d function - computeGraphMetric - that calculates top vertices for every metric it supports. For example, to determine top 25 vertices in Betweenness Centrality (a measure of the influence a node has over the spread of information through the network) we simply do following:

Top degree vertices:

topDegree = computeGraphMetric(conn, policeGraphUn, type="degree", top=30)

createTopMetricPlot(topDegree, 'degree', ylab='Degree', title='Top 30 Officers (Vertex Degree)')

Similarly, other metrics work, such as centrality measures betweenness and PageRank:

topbetweenness = computeGraphMetric(conn, policeGraphUn, type='betweenness', top=30)

createTopMetricPlot(topbetweenness, 'betweenness', ylab='Betweenness', 
                    title='Top 30 Officers (Betweenness)')

# Top PageRank Vertices
topPagerankPolice = computeGraphMetric(conn, policeGraphDi, type='pagerank', top=30)
createTopMetricPlot(topPagerankPolice, 'pagerank', ylab='PageRank', 
                    title='Top 30 Officers (PageRank)')

Aster as a Graph Query Engine

We saw earlier how to load a whole graph from Aster into R memory. Visualizing a whole graph is not particularly useful with big data and sometimes is impossible (due to memory constraints). For this reason toaster offers several ways to discriminate which and how much of graph data to load from Aster into network objects. Effectively, combined with R visualization functions, these mechanisms become a graph query interface to the Aster graph data store.

Filtering vertices and edges

Using SQL WHERE filters either graph vertices or graph edges or both when defining metadata toagraph object:

policeGraphUnFilter = toaGraph(vertices = "dallaspolice_officer_vertices", 
                         edges = "dallaspolice_officer_edges_un", 
                         directed = FALSE,
                         key = "officer", source = "officer1", target = "officer2", 
                         vertexAttrnames = c("offense_count"),
                         edgeAttrnames = c("weight"),
                         vertexWhere = "officer ~ '[A-Z].*'")

showGraph(conn, policeGraphUnFilter, 
          node.label="vertex.names", node.size="offense_count", 
          legend.position="none", label.size=3)

The same can be accomplished on per-call basis without defining filters in metadata: with toaster its graph compute functions always offer arguments to override corresponding filters in toagraph (note that toagraph object is back to original policeGraphUn with no filters, and we used both vertex and edge filters this time):

showGraph(conn, policeGraphUn, 
          vertexWhere = "officer ~ '[A-Z].*'", 
          edgeWhere = " weight > 0.10 ", 
          node.label="vertex.names",  node.size="offense_count", 
          legend.position="none", label.size=3)

Subgraphs

The conditions on the vertex keys with arbitrary logic and access to data beyond graph tables are available to define subgraphs. computeGraph has an optional argument v that is either list of vertex key values or a SQL query that returns such values when executed in Aster:

subGraphOfficerLetters = computeGraph(conn, policeGraphUn,  
                    v = "SELECT officer 
                           FROM dallaspolice_officer_vertices 
                          WHERE officer ~ '[A-Z].*'")
subGraphOfficerLetters %v% "color" = 
  substr(get.vertex.attribute(subGraphOfficerLetters, "vertex.names"), 1, 1)
GGally::ggnet2(subGraphOfficerLetters, node.label="vertex.names", label.size=3,
               node.size="offense_count", size.cut=TRUE, node.color="color",
               palette = "Set2", legend.position="none")

Subgraph using list of vertices:

subGraphOfficerM = computeGraph(conn, policeGraphUn,  
                    v=list('M175','M190','M203','M183','M202','M193','M201'))

GGally::ggnet2(subGraphOfficerM, node.label="vertex.names",  node.size="offense_count",
               node.color="vertex.names", size.cut=3, legend.position="none", label.size=3,
               palette = "Set3")

Subgraph of the top degree nodes:

topDegreeNet = computeGraph(conn, policeGraphUn, v=as.list(as.character(topDegree$key)))

topDegreeNet %v% "degree" = 
  topDegree[match(get.vertex.attribute(topDegreeNet, "vertex.names"), topDegree$key), "degree"]

GGally::ggnet2(topDegreeNet, node.label="vertex.names", node.size="degree", 
               legend.position="none")

Ego-graphs

The ego graph (or neighborhood graph) requires a vertex v and an integer order greater than 0. Then ego graph is induced by a vertex v and all of its neighbors that are reachable within distance not greater than order of the ego graph. In Aster, it requires combination of several graph functions and SQL to implement ego graph. toaster function computeEgoGraph does exactly that:

egoCenters = as.list(as.character(topPagerankPolice$key[1:3]))
print(egoCenters)
#> [[1]]
#> [1] "4682"
#> 
#> [[2]]
#> [1] "10459"
#> 
#> [[3]]
#> [1] "7970"

egoGraphsTopPagerank = computeEgoGraph(conn, policeGraphDi, order = 1, ego = egoCenters)

egoGraph = setVertexColor(egoGraphsTopPagerank[[1]], egoCenters[[1]])
GGally::ggnet2(egoGraph, node.label="vertex.names",  node.size="offense_count",
               legend.position="none", color="color")

egoGraph = setVertexColor(egoGraphsTopPagerank[[2]], egoCenters[[2]])
GGally::ggnet2(egoGraph, node.label="vertex.names",  node.size="offense_count",
               legend.position="none", color="color")

egoGraph = setVertexColor(egoGraphsTopPagerank[[3]], egoCenters[[3]])
GGally::ggnet2(egoGraph, node.label="vertex.names",  node.size="offense_count",
               legend.position="none", color="color")