Billion Node Analysis Using Spark, Scala, and R

Overview

This is an analysis of the Billion Triples Challenge dataset using a combination of Spark, Scala and R. Information about the Billion Triples Challenge can be found here: http://km.aifb.kit.edu/projects/btc-2010/.
To summarize, the BTC researchers started with a seed set of URIs and collected data by following the links from the seed pages to gather connection information. In turn, the links found in subsequent pages were also followed. The project sponsors and researchers hope to use the information for application improvement and assist users in performing tasks. The BTC is part of a larger research effort known as the Semantic Web.
The data used for this particular analysis is hosted on Amazon’s S3 servers by the University of Washington and can be found here: http://uw-cse-344-oregon.aws.amazon.com.s3.amazonaws.com/. The data are in the format of (subject) (predicate) (object). For this analysis, focus was on the (subject) part of the data and the number of outgoing connections each had.

Approach

The data was stored on S3 as a series of 2GB text files. I chose to use Spark to process and summarize the data. Why? Spark represents a significant performance improvement over some MapReduce implementations but is flexible enough to leverage existing file stores such as HDFS.
I chose to use Spark with Scala, the language that Spark was written in. Scala can be thought of as an improved, streamlined version of Java. It is compatible with Java: Java libraries can be called from Scala, and Scala programs can be used by Java applications if the right libraries are included. Scala is less rigid in its type definitions and requires less boilerplate code that Java has become (in)famous for. More information on Scala can be found here: https://www.scala-lang.org/. I hope to do more with Scala as time permits.

My first attempt at the analysis used a three node cluster on Amazon’s EMR (Elastic MapReduce) cloud service. However, this proved to be insufficient for the task. Just running simple schema operations took hours. I deleted this cluster and restarted the analysis with a twelve server cluster. The total run time was approximately two hours with this setup.

After the initial processing and summarization, I exported the resulting dataset from the hadoop file system to the root node’s file system. From here, the “scp” (secure copy) command was used to bring the dataset down for further analysis. I chose to use R for this. Why R? Dataframes are integral to R and are easily manipulated in this language. Also, the data visualization libraries such as ggplot are expressive.

Analysis

Top 10 sites by connection

Here are the top 10 sites and the number of outbound connections each had.
Notice the rapid drop off of connections after the first couple.

Top 10 Outbound Connections
Subject	Outbound
http://www.proteinontology.info/po.owl#A	1006296
http://openean.kaufkauf.net/id/	764528
http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy	589809
http://purl.uniprot.org/citations/15685292	353625
http://sw.cyc.com/CycAnnotations_v1#externalID	351764
http://xmlns.com/foaf/0.1/Document	323828
http://sw.cyc.com/CycAnnotations_v1#label	263823
http://purl.org/dcmitype/Text	263141
http://sw.opencyc.org/concept	255195
http://purl.uniprot.org/citations/16973872	248626

Exponential Distribution

If the number of outbound connections were truly random, I would expect them to follow an exponential distribution as the histogram below shows.

Histogram

The histogram below looks at the summary data. It groups and counts the nodes in the BTC dataset by outbound connection. The data appears to create almost a single bar close to zero. This indicates that the vast majority of sites have very few outbound connections.

Here’s a visualization with a little more detail. I have used a logarithmic scale on the y axis, and have limited the number of outbound connections to a range of zero to 10,000.

Conclusion

If the connections in the data were random, the number of connections should be expected to follow an exponential distribution: the number of outbound connections a site has should become exponentially less common as the number increases. While there are fewer nodes with higher connections, the connections do not follow an exponential distribution pattern. The plots show that the vast majority of the sites in the BTC data contain few connections. The drop off is much faster than an exponential distribution. In looking at the log scale graph, it is apparent that most sites have fewer than 2500 outbound connections.