Introduction

Exploratory data analysis (EDA), as you might expect, is the process of exploring your data. Normally, this process includes examining the structure and components of the dataset, the distribution of the variables, and any relationships that may exist between pairs of variables. The idea of an EDA is to sumarize their main characteristics. Data visualization provides one of the most effective tools for an exploratory data analysis as deep insights can come to light quickly with minimal effort.

There are several goals of exploratory data analysis:

1. Determine if there are any problems with your dataset.
2. Determine whether your question can be answered with your data.
3. Develop a sketch of the answer to your question.
4. Assess assumptions on whicn the statistical inference will be based.
5. Select techniques that may be appropriate for future analysis.

At the end of an EDA, you should have a sense of what the answer to your question is and have sufficient information to move forward with the full analysis. Remember, the epicycle of analysis applies here. If your data cannot help answer your question, you need to find new data.

The Checklist

While there is no exact list of things that need to be done for an exploratory data analysis, some certain ideas pop up quite frequently. Below, you will find some of the steps you may wish to follow to complete your EDA.

1. Formulate your question
2. Read in your data
3. Look at the top and the bottom of your data
4. Check your “NA”s
5. Validate with at least one external data source
6. Make a plot
7. Try the easy solution first
8. Follow up

Formulate The Question

Above all else, a proper question is of the utmost importance. Clients rarely give you the correct question at the beginning. Thus, prying the correct question from them is an important first step in the process. A sharp question or hypothesis can serve as a dimension reduction tool that can eliminate variables that are not immediately relevant to the question.

As an example, a bank might ask: “How can we make higher profits”. The answer could be as simple as changing the price of certain products. Conversely, they could want to generate more profits by attracting additional clients or reducing their staff. Perhaps the current clientele at the bank are not purchasing certain products from the bank but rather from a competitor and the bank wants to know why the clients are leaving their institution. All of these questions fall under the umbrella of making higher profits. A sharp question will focus your analysis. At this point, you need to determine if you have the correct data to determine the answer.

Read The Data

The next task in any exploratory data analysis is to read in the data. If the data is messy, it will require some cleaning. Other times, the cleaning will be done before you receive the data. Basic cleaning can be done in Excel. Otherwise, make use of the dplyr package.

Look At Your Data

It’s often useful to look at the “beginning” and “end” of a dataset. This lets you know if the data were read in properly, things are properly formatted, and that everything is there. If your data are time series data, then make sure the dates at the beginning and end of the dataset match what you expected. The head() and tail() commands are useful for this. Reading issues are often easiest to see when looking at the end of the data. Looking at the first few lines is helpful if you are reading data from Excel and extra columns have been added to the data.

Always Check for NA’s

In general, counting things is usually a good way to determine if something is wrong. In the simplest case, if you expect 1,000 observations and it turns out there’s only 20, something wrong happened. But there are other areas that you can check depending on your application. To do this properly, you need to identify some landmarks that can be used to check against your data. For example, if you are collecting data on people, such as in a survey or clinical trial, then you should know how many people there are in your study.

Validate With External Sources

Making sure your data matches something outside of the dataset is very important. It allows you to ensure that the measurements are roughly in line with what they should be. External validation can often be as simple as checking your data against a single number, such as a known average.

Make a Plot

Graphing the data is a great way to further check your data. It gives you some insights and it will help you identify if you have incorrect data. This is also where you start to put together your plan for the presentation of your findings.

There are two key reasons for making a plot of your data: creating expectations and checking deviations from expectations.

At the early stages of analysis, you may have little sense of what is going on in the data. Even if you have tidied your data, a dataset that is big enough, it will be difficult to simply look at all the data. Making a plot serves as a summary and is a useful tool for setting expectations for what the data should look like.

Follow-up Questions

  1. Do you have the right data? Sometimes at the conclusion of an exploratory data analysis, the conclusion is that the dataset is not really appropriate for this question.
  2. Do you need other data?
  3. Do you have the right question?

The goal of exploratory data analysis is to get you thinking about your data and reasoning about your question. At this point, we can refine our question or collect new data, all in an iterative process to get at the truth.

Citations

“7 Exploratory Data Analysis | R for Data Science.” Accessed June 29, 2021. https://r4ds.had.co.nz/exploratory-data-analysis.html.

Patil, Prasad. “What Is Exploratory Data Analysis?” Medium, May 23, 2018. https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15.

Tukey, John (1977), Exploratory Data Analysis, Addison-Wesley.