Epicycle of Data Analysis

Data analysis does not follow a linear process. It is a very iterative process that is more accurately reflected by cycles rather than a sequence. At each step in the process, more information is learned and this new information is used to inform, refine and redo previous steps. It also helps us determine what to do next.

The five core pieces of a data analysis include:

At each stage of the process, keep in mind that it is useful to set expectations, collect data that align with your expectations and then revise the expectations or data so that the two coincide. Often an analysis will begin with a client’s data set. This data may need to be cleaned, revised, augmented or otherwise changed in order for you to meet your goals. Do not be afraid to do this at every step of the analysis. This is the idea behind the epicycle of analysis. At every stage of the analysis, this process will help you as you refine your question, complete an exploratory analysis, construct a model, interpret the results and communicate those results to your client.

\[ \] Set Expectations Collect Information Revise Expectations
Question The question is of interest to the client Literature review/ Ask questions Refine question
Exploratory Analysis Data are appropriate Make exploratory plots Refine question/ collect more data
Formal Model Primary model answers question Secondary models/ sensitivity analysis Revise model
Interpretation Interpretation of analysis provide specific and meaningful answers to client Interpret the entirety of analysis with focus on effect sizes and uncertainty Revise data analysis and/ or provide specific and interpretable answer
Communicate Process and results of analysis are understood by client Elicit feedback Revise analysis/ approach/ presentation

The Question

“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data”. That quotation, by John Tukey, illustrates that even the best data scientist cannot solve every problem every time. Data science is difficult and frustrating at times. Sometimes you have the right data to solve a problem and sometimes you don’t. Sometimes clients know exactly what they want and can communicate that to you clearly while others (the vast majority fall into this category) won’t.

Types of Analysis

The type of analysis you will perform in a data science project depends highly on the question.

A descriptive data analysis seeks to summarize the measurements of a single data set without seeking any sort of interpretation. In this sort of an analysis, the analyst may find means, percentages and draw some basic visualizations.

Exploratory data analysis focuses on finding trends or correlations between variables. For example, you may notice that sales of ice cream increase as the weather increases. In an exploratory analysis, you do not tend to verify the discovery. Follow up studies or analysis are required to confirm results from exploratory analysis.

An inferential data analysis attempts to quantify whether the proposed correlation or trend noticed in an exploratory analysis would or should apply beyond the data set. These types of analysis are commonly found in scientific papers and indeed make up a majority of health journal articles. We are often unable to determine why a relationship exists; we simply observe a relationship exists.

A predictive analysis uses a subset of measurements or features to predict another measurement or outcome on a single experimental unit (for example, on a single individual). For example, based on a set of traits that we know about an individual, we might use a predictive analysis to determine if they will like a particular book or movie. Here we are not really interested in what causes them to like their choice; but rather, it is focused on predicting what their choice might be.

In a causal analysis, the analyst is interested in determining how changes in one measurement impact a second measurement. For example, if we increase the rate at which we water our gardens, what impact does that have on the size of our plants? Here, the important piece is to quantify, both in terms of magnitude and direction, how the change in one variable impacts the other.

Finally, a mechanistic data analysis seeks to identify the average effects between two variables. For example, we know that smoking increases your chances of cancer. While it is not definite that you will develop cancer, we do see a clear causal relationship between smoking and the probability that the smoker develops cancer. This type of analysis seeks to understand the effect and how it works. These sorts of analysis are often found in engineering and design problems.

Citations

Leek, J. (2015). The Elements of Data Analytic Style. A guide for people who want to analyze data. Leanpub.

Peng, Roger D. and Matsui, Elizabeth. (2015 – 2017) The Art of Data Science. Skybrude Consulting, LLC.

Peng, Roger D. Exploratory Data Analysis with R. Skybrude Consulting, LLC.