Problems with large-scale, computationally intense data analysis
Potential solutions and their downsides

Roughly one third of Sanford PhD students have run into serious technical issues when running statistical analyses and computations on large datasets. In this memo we briefly summarize the results of a survey we conducted to (1) understand the exact issues these students have faced and (2) discover which resources they have used to attempt to solve these issues.

The opinions here largely reflect students’ direct experience with computational limitations, and come primarily from students in their 2nd–5th years. Notably, though, outside this survey many students have expressed similar concerns and worries for their future analytical needs (including first- and second-year students that have not yet begun working with data). We received responses from 12 of the 31 Sanford PhD students, and we assume that students who did not respond do not have significant computing needs (or have not yet run into them).

Problems with large-scale, computationally intense data analysis

With the advancement of methodological innovations across the disciplines, many students now run statistical models that require far more computational power than the basic models and tools most commonly run in statistical packages, including:

Non-parametric models
Multi-level random effects models
Regression discontinuity models
Bayesian models
GIS geolocation

Additionally, students are accessing and analyzing data that is quite large and complex, including detailed longitudinal health and education data and comprehensive state and federal voter records. Some students deal with datasets as large as 50 GB, with millions of rows and thousands of variables.

However, students are often unable to run these computationally and data intensive analyses, which impedes their ability to do work, as they are forced to limit the number and type of analyses they perform and are unable to access useful data. Even when writing efficient code that uses parallel processes across multiple CPU cores, one student estimates that “I would say overall that my productivity could have been double what it has been during graduate school with better computational power.” The majority of students responding to the survey have faced at least a moderate amount of challenges with computational performance.

These computational issues universally affect students’ personal work and often occur when students do work for a faculty member as a research assistant. In one case, a student’s adviser was able to use grant money to purchase their own high performance computer, but this occurs rarely.

Potential solutions and their downsides

Virutal private servers

There have been many technological innovations to address the performance issues inherent in big data analysis, and it is now possible to distribute large-scale tasks across clusters of cloud-based computers at very low cost. For instance, students can (and do) create virtual private servers with Amazon EC2 or DigitalOcean for pennies per hour and run their code in open source languages like R and Python. However, this solution does not work for all students because of two general limitations.

First, most Sanford PhD students use Stata for their data analysis. As Stata is proprietary software, it requires a license for each installed instance, making it difficult to install across a cluster of virtual Linux machines. Additionally, the Stata Corporation offers multiple versions of its software, with each version able to handle different kinds of models and size of data. Because of cost prohibitions, many students cannot and do not purchase more expensive licenses for more powerful versions of Stata.

Second—and perhaps the most serious obstacle to using high performance distributed cloud-based solutions—many students use data that contains sensitive or personally identifiable information and is subsequently highly restricted. Of the students who responded to the survey and have faced performance issues, only three have been able to store their data on their personal machine without restrictions. All others are required to keep their data on highly secure storage (sometimes on an airgapped computer in a locked room) or on secure server space provided by the Sanford School. Privacy restrictions preclude students from uploading or accessing this data with clusters of virtual machines.

SSRI High-Performance Compute Servers

The recommended solution for doing computationally intensive analysis on private data is to use SSRI’s High-Performance Computer Servers, which is a cluster of powerful—and importantly, secure—servers that students can access through a VPN. Many students facing performance issues have used this service, but not a majority.

Many students who do not use SSRI’s servers do so because the servers are not powerful enough for their data. Half of students have run into technological limitations on SSRI’s ostensibly high performance cluster

As more students across the social science disciplines at Duke face similar issues with larger data and more intensive computation, the additional load on SSRI’s servers is showing. Students complain that SSRI’s servers often do not allow them to log in because too many people are on and that the computers regularly (and inexplicably) freeze or shut down. The servers also use older software that is updated infrequently, and there are strict limits to which external packages or modules can be installed. In short, students feel that the SSRI High-Performance Computer Servers are too busy, too slow, and too old.

PhD student high performace computing needs

Andrew Heiss, Anika Schenck-Fontaine, and Adebola Olayinka

March 17, 2016

Problems with large-scale, computationally intense data analysis

Potential solutions and their downsides

Virutal private servers

SSRI High-Performance Compute Servers

Other solutions