Data science: how is it different to statistics?

Recently, there has been much hand wringing about the role of statistics in data science. There is a lot of fear that statistics is seen as irrelevant. In this and coming columns I’ll discuss both the threat and opportunity of data science: I believe that statistics is a crucial part of data science, but equally statistics departments are at grave risk of becoming irrelevant unless they teach the tools that people need. In this, my first column, I’ll discuss why I think data science isn’t just statistics, and important parts of the process that have traditionally have been out of bounds for statistics research.

What is there to data science apart from statistics? A lot! The diagram below illustrates what I think of as the major parts of the data science process.

You start by collecting data and questions, perform data analysis (using visualisation and models), then communicate the results. It’s rare to walk this process in one direction: often your analysis will reveal that you need new or different data, or when communicating your analyse to others you’ll discover a flaw in your model, or come up with new questions.

Statistics has a lot to say about collecting data: survey sampling and DoE are well established fields backed by decades of research. Statisticians, however, have little to say about collecting and refining questions. Good questions are crucial for good analysis, but there is little research in statistics about how to solicit and polish good questions, and it’s a skill rarely taught in core PhD curricula.

Once the data has been collected, it needs to be tidied (or normalised) into a form that’s amenable for analysis. Organising data into the right “shape” is essential for fluent data analysis: if it’s in the wrong shape you’ll spend the majority of your time fighting your tools, not questioning the data. I’ve been working on this problem for quite some time (culminating in the http://vita.had.co.nz/papers/tidy-data.html), but I’m aware of little similar work by statisticians.

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualisation and modelling. Visualisation and modelling are complementary. Visualisations surprise you, and can help refine vague questions. However, visualiations rely on human interpretation, so the ability to scale is fundamentally constrained. Models scale much better, and it’s usually possible to throw more computing at the problem. But models are constrained by their assumptions: a model can not fundamentally surprise you. In any real analysis you must use both visualisations and models. But the vast majority of statistics research is on modelling; much less is on visualisation; and fewer still on how to iterate between modelling and visualisation to get to a good place.

Finally, the end product of an analysis is not a model; it is rhetoric. An analysis is meaningless unless it convinces someone to take action. In business, this typically means convincing your senior management who have little statistical expertise. In science, it typically means convincing reviewers. The presentation of statistical results is absolutely critical for its utility. Communication is not a mainstream thread of statistics research; if you attend the JSM, it’s easy to come to the conclusion that most academic statisticians couldn’t care less about the communication of results.

In business, analyses are often not done just once, but need to be performed again and again as new data come in. These analysis need to be robust in both the statistical sense (i.e. to changes in the underlying distributions/assumptions) and in the software engineering sense (i.e. to changes in the underlying technological infrastructure).

Statistics is a part of data science, not the whole thing. Attempting to claim that data science is “just” statistics makes statisticians look out of touch, and belittles the many other contributions outside of statistics. Does this model resonate with you? Do you think this is what statistics really is about? Please let me know your thoughts at hadley@rstudio.com.