In this lesson, we will discuss several analysis techniques that are used to explore a particular dataset. This process, called “exploratory data analysis” (or EDA for short), provides the analyst with a general understanding of the data, including what the dataset consists of, what types of variables are present, the distribution of the variables, and how certain variables may be related to each other. This process should always occur before jumping in and simply trying to build models, as a significant amount of insight can be obtained which should inform what kinds of models may be the most effective at solving the problem.
R4DS describes EDA as an iterative cycle in which you:
Generate questions about your data.
Search for answers by visualizing, transforming, and modeling your data.
Use what you learn to refine your questions and/or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.
In Chapter 3, Shah describes several different types of analysis techniques, but at this time we are only going to consider the following:
Descriptive Analysis (Section 3.3): Used to describe historical or present conditions based on a collection of data.
Diagnostic Analysis (Section 3.4): Used for discovery or to determine why something happened. Correlation analysis is one of the most common techniques in this category.
Exploratory Analysis (Section 3.7): Refer to the aforementioned EDA process and description.
In order to prepare for the hands-on portion of this lesson, let’s load the tidyverse now. If you are not knitting this markdown file at this time, make sure you click the green arrow in order to run the line of code.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Recall from our discussion about data the following concepts and definitions:
Variable: A measurable property or attribute, such as the employment status and age of each adult in the U.S., or the sales revenue of each franchise in a fast-food chain
Measurement: The process of assigning a value to a variable
Discrete Variable: Can only assume a countable number of values, such as the numbers on the faces of a die or the number of children in a family
Continuous Variable: can assume an infinite number of values without gaps or jumps, such as the amount of time between machine failures or the daily temperature
Measurement Scales
Outcome or Response Variable: The object of your study (for example, a person’s weight); this is also sometimes called the dependent variable
Predictor or Explanatory Variable: the things that are (or may be) related to the object of your study (for example, food intake, level of exercise, genetic factors, prescription drugs taken, amount of stress, marital status, annual salary, etc.); this is also sometimes called an independent variable
The reason a variable is called by that name is due to the fact that the values it takes on will vary from one observation to another. If I was to ask each student in the class to provide their age, we would not observe the same value for every single person. Instead, we would see that there is variation among the reported ages. The degree of variation for a given variable is quite important to understand, so we will explore this in depth in the next few sections.
Every variable has its own pattern of variation, which can reveal interesting information, and the best way to understand that pattern is to visualize the distribution of the variable’s values. A frequency distribution is a technical term that refers to a plot or graph that can be used to quickly evaluate the degree of variation that exists. How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. Generally speaking, following are the most common types of plots you will encounter:
Histogram: Used to illustrate the distribution of a single quantitative (numeric) variable
Bar Chart or Bar Plot: Used to illustrate the distribution (i.e., counts) of a categorical variable
Boxplot: Used to illustrate the distribution of a single quantitative variable across the levels of a single categorical variable
Scatterplot: Used to illustrate the relationship between two quantitative variables
We will explore each of these in the following sections.
Note that Shah also discusses pie charts on page 73. If a categorical variable has more than two levels, pie charts can be extremely difficult to evaluate as the human eye cannot easily distinguish between slices of a circular area. For this reason, I would encourage you to refrain from using them as much as possible (but unfortunately business executives love them, so you may have to create them from time to time).
In Shah 3.3.2, a dataset is presented that includes information about the productivity of a group of data science professionals as it relates to years of experience and whether the person went through extensive statistics training. Before we continue, let’s answer the following questions from Try It Yourself 3.1 on page 72:
As previously mentioned, histograms are used to illustrate the distribution of a single quantitative (numeric) variable. In this section, we will use the Productivity dataset described in Table 3.1 on page 71 of the Shah book.
The dataset is contained within this project (called ‘Table_3.1.csv’), so let’s import the data into the environment and name in ‘productivity’:
productivity <- read_csv("Table_3.1.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Productivity = col_double(),
## Experience = col_double(),
## Training = col_character()
## )
Let’s also quickly review the results of the import process by viewing the first 10 rows of the data:
productivity
## # A tibble: 30 x 3
## Productivity Experience Training
## <dbl> <dbl> <chr>
## 1 5 1 Y
## 2 2 0 N
## 3 10 10 Y
## 4 4 5 Y
## 5 6 5 Y
## 6 12 15 Y
## 7 5 10 Y
## 8 6 2 Y
## 9 4 4 Y
## 10 3 5 N
## # … with 20 more rows
Hands-On Example 3.1 on page 72 illustrates a histogram for the ‘productivity’ variable. See if you can reproduce it using ggplot, and then interpret what you see. Hint: set the number of bins to 7.
ggplot(data = productivity, aes(x = Productivity)) +
geom_histogram(bins = 7)
Note that the histogram produced here looks quite different from what appears in the book. This is because the “break points” (that is, the “boundaries” of each bin) that are selected that do not automatically equal integers. This can render bins that have rather peculiar starting and ending points. Also notice that the borders of each bin are not shown, sometimes making it more difficult to identify each bin. However, these are all issues that can be addressed with a few extra lines of code.
While ggplot is typically the preferred approach for creating graphics in R (especially production-ready graphics), the trade-off is that it usually requires more code to produce the desired effect. Base R comes with several graphics functions that are a bit more crude but that can be easier to produce during the EDA process. As an example, note the following chunk, which uses Base R’s hist()
function, and has the single argument of dataset_name$variable_name
:
hist(productivity$Productivity)
Base R graphic functions can be useful for a quick understanding, but I still encourage you to become proficient with ggplot()
as this will benefit you in the long run (and in the short run for your project submissions!). As a comparison, here is the ggplot code for reproducing the plot in the book as closely as possible. You will not need to write this much code every time you create a plot using ggplot; I instead provide this as an illustration of how involved they can become if you decide to tweak every little element possible.
ggplot(data = productivity, aes(x = Productivity)) +
geom_histogram(breaks=c(2,4,6,8,10,12,14,16), color = "black", fill = "steel blue") +
labs(x = "Productivity", y = "Frequency", title = "Histogram of productivity") +
theme_bw() +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
plot.title = element_text(hjust = 0.5))
Exercise
Use R to complete Try It Yourself 3.2 from page 73 of Shah. Remember that you will first need to bring the dataset into your environment (name it ‘pizza’). Also, rename ‘X’ to be ‘fee’ and ‘Y’ to be ‘startup_cost’. Set the number of bins in the histogram to 10. Report on what you see.
pizza <- read_csv("OA 3.1 - pizza.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X = col_double(),
## Y = col_double()
## )
pizza <- rename(pizza, fee = X, startup_cost = Y)
ggplot(data = pizza, aes(x = startup_cost)) +
geom_histogram(bins = 10)
In an ideal world, histograms would reveal that the data is distributed symmetrically around the center. This so-called normal distribution is characterized by the familiar bell-shaped curve. However, in reality, data is often asymmetrically distributed, and this condition—which we call skewness—is important to understand when it is present. Fortunately, histograms are extremely useful in highlighting when this is present.
There are two types of skewness:
Negative Skewness: The tail of the distribution is on the left (negative) side, and the mean is less than the median.
Positive Skewness: The tail of the distribution is on the right (positive) side, and the mean is greater than the median. Positive skewness is much more common than negative skewness, as it typically occurs for variables that have a natural lower bound of zero (i.e., for which negative values do not make sense), which is quite often.
The following graphic is helpful in understanding the two types of skewness:
Diagram of (a) a negatively skewed distribution, (b) a normal distribution, and (c) a positively skewed distribution.
As previously mentioned, bar charts (sometimes called bar plots) are used to illustrate the distribution (i.e., counts) of a categorical variable. While they appear similar to histograms, because they are based on the levels of a categorical variable the “bins” are actually the individual levels of the variable. This kind of visualization is useful for determining how many observations are associated with each level of a categorical variable.
As an example, let’s consider the Training variable (which is categorical) from the productivity data. We can use ggplot and the geom_bar
geom to create the plot as follows:
ggplot(productivity, aes(x = Training)) +
geom_bar()
Notice that each bar represents one of the two levels (N or Y) of the Training variable, and the height of the bar shows the number of observations for each level. This is essentially a replication of Figure 3.8 on page 77.
Another example is using the diamonds dataset from the tidyverse. This example is from Section 7.5.1 of R4DS online (or page 94 of the printed book). Notice the slight variation in the code; that is, that you can also specify the aesthetics within geom_bar()
(as opposed to within the first line of ggplot code as in the previous example).
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
Often, a single number can tell us enough about a distribution. This is typically a number that points to the “center” of a distribution. In other words, we can calculate where the “center” of a frequency distribution lies, which is also known as the location of cntral tendency. We put “center” in quotes because it depends on how it is defined. There are three measures commonly used:
Mean: The average of all values, calculated as the sum of all values divided by the number of values, as shown in Equation 3.1 on page 76:
\[\overline{x} = \frac{x_1 + x_2 + x_3 + ... + x_n}{n} = \frac{\sum x}{n}\]
There is a significant drawback to using the mean as a measure of central tendency: it is susceptible to the influence of outliers. Relatedly, it is only meaningful if the data is normally distributed or close to being normally distributed.
Median: The middle value of a variable that has been sorted from low to high (for an odd number of observations), or the average of the middle two data points (for an even number of observations). The median is also known as the 50th percentile (more on percentiles in a moment). Medians are more relevant that means when the distribution is highly skewed in one direction or the other, and/or when there are significant outliers present. This explains why, for example, home prices are typically listed as medians instead of means: if there are one or two mansions in a dataset containing mostly standard-size houses, the mean would be greatly inflated by the prices of these larger homes.
Mode: The most frequently occurring value in a dataset. Practically speaking, this value does not have nearly as much use as the mean or median.
Simply looking at the central point (e.g., the mean or median) of a distribution doesn’t tell the whole story. It is also important to understand the dispersion of values across the distribution. There are both numerical and graphical ways to specify this. We’ve already seen how a histogram can be a helpful visual aid, but we’ll explore an alternative visual momentarily. For now, we will focus on numerical methods.
The following are some of the most common quantities for measures of dispersion:
Range: The value resulting from subtracting the smallest value in the distribution from the largest. This gives you an idea of the overall “span” of the variable.
Interquartile Range (or IQR): This is the middle 50% of the data, ranging from the 25th percentile to the 75th percentile. This is used to help minimize the influence of significant outliers that may lie at either end of the distrbution.
In R, there are several ways to find these quantities, but perhaps the easiest is to use the summary()
function from base R. It can either be used on an entire dataset, or a single variable through the use of the dataset_name$variable_name
syntax we saw previously with the hist()
function. Let’s use this function on the entire productivity dataset:
summary(productivity)
## Productivity Experience Training
## Min. : 2.000 Min. : 0.0 Length:30
## 1st Qu.: 5.000 1st Qu.: 5.0 Class :character
## Median : 6.500 Median : 9.5 Mode :character
## Mean : 7.267 Mean :10.0
## 3rd Qu.: 9.000 3rd Qu.:15.0
## Max. :15.000 Max. :20.0
Notice that for the Productivity variable, the range is 15 - 2 = 13, and the IQR is 9 - 5 = 4 (the Shah book incorrectly lists this value as 5 on page 79). Likewise, the range for the Experience variable is 20, while the IQR is 10.
Variance: A measure used to indicate how spread out the data points are in relation to the mean (technically, it is the average of the squared differences from the mean). If the individual observations vary greatly from the group mean, the variance is large; and vice versa. It is also important to distinguish between the variance of a population and the variance of a sample. They have different notations and are computed differently. The variance of a population is notated as \(\sigma^{2}\), while the variance of a sample is notated as \(s^{2}\). See page 80 of the Shah book for the calculation formulas.
Standard Deviation: This is simply the square root of the variance, which ensures that the measure of average spread is in the same units as the original measure). Just like with variance, however, a larger SD indicates a larger spread in the data, and vice versa.
In R, the sample variance and sample standard deviation can be calculated using the var()
and sd()
functions, respectively.
var(productivity$Productivity) # variance
## [1] 11.92644
sd(productivity$Productivity) # standard deviation
## [1] 3.453467
If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.
A common method for displaying the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:
A box that stretches spans the interquartile range (IQR). In the middle of the box is a line that displays the median (i.e., the 50th percentile) of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.
A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
The following graphic demonstrates the relationship between a plot of the actual values in a distribution, a rotated histogram of the distribution, and a boxplot of the distribution:
Comparison of the actual values in a distribution (left) with a histogram (middle) and a boxplot (right) of the distribution.
Boxplots are informative as a graphic for a single numeric variable, but really become useful when combined with a categorical variable. As an example, let’s compare the distribution of productivity for both those without (left) and with (right) extensive statistics training. To create this, we will use the geom_boxplot
within ggplot:
ggplot(productivity, aes(x=Training, y=Productivity)) +
geom_boxplot()
What do you observe from this? What conclusion might you make about the effect of statistics training on the productivity of a data scientist?
Traditionally, boxplots have “whiskers” on the ends (hence their sometimes being called “box-and-whisker” plots), but ggplot does not automatically include them. You can rectify this by adding a line of code before the geom_boxplot()
function in which the stat_boxplot()
function is used to add “errorbars” and adjust the width if desired:
ggplot(productivity, aes(x=Training, y=Productivity)) +
stat_boxplot(geom='errorbar', width=0.3) +
geom_boxplot()
Alternatively, for a quick-and-dirty boxplot, you can use base R. Notice here that the syntax is
(y variable ~ x variable)
. The tilde (~) is typically read as “as described by” or “as modeled by”. We will see this syntax again in regression modeling.
boxplot(productivity$Productivity~productivity$Training)
To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination. One way to do that is to rely on the built-in geom_count()
:
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.
Another approach is to compute the count with dplyr
:
diamonds %>%
count(color, cut)
## # A tibble: 35 x 3
## color cut n
## <ord> <ord> <int>
## 1 D Fair 163
## 2 D Good 662
## 3 D Very Good 1513
## 4 D Premium 1603
## 5 D Ideal 2834
## 6 E Fair 224
## 7 E Good 933
## 8 E Very Good 2400
## 9 E Premium 2337
## 10 E Ideal 3903
## # … with 25 more rows
Then visualize with geom_tile()
and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = cut, y = color)) +
geom_tile(mapping = aes(fill = n))
The way to interpret this plot is that the darker colors represent color and cut combinations that have fewer examples (i.e., counts) in the data, whereas lighter colors represent combinations with a greater number of examples. In other words, the most common combination in this dataset are ideal cut with a color of G (there are 4,884 such examples in the data).
If the categorical variables are unordered, you might want to use the seriation
package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap
or heatmaply
packages, which create interactive plots. We will not discuss these here.
The most common method for exploring the relationship between two continuous variables is correlation. This is a statistical analysis that is used to measure and describe the strength and direction of the linear relationship between the two variables. Strength indicates how closely two variables are related to each other, while direction indicates how the value of one variable would change as the other variable changes.
Correlation is measured on a scale of -1 to +1, where -1 signifies a perfect negative relationship (that is, for every unit change in the first variable, there is an equal negative change in the second variable) and +1 signifies a perfect positive relationship (that is, for every unit change in the first variable, there is an equal positive change in the second variable). A value of zero indicates no relationship at all between the two variables. While there is no hard and fast rule, generally speaking a value less than or equal to |.20| is considered a weak relationship, a value between |.20| and |.50| is considered a moderate relationship, and anything greater |.50| is considered a strong relationship. The calculation for the statistic, which is notated as \(r\), is shown on page 82 of Shah.
In R, the simplest way to calculate correlation is through the cor()
function. However, because this metric is only relevant for numeric variables, you have to make sure that only numeric columns are passed to the function. In the example below, we are calculating the correlation between the Productivity and Experience variables, but need to exclude the Training variable from the function (since it is categorical).
There are a few ways to accomplish this using the select
function from dplyr. We can either 1) specify the column numbers (1 and 2), 2) specify the columns by name (Productivity and Experience), or 3) use the select_if
function combined with the is.numeric
argument to select only those columns that are numeric. This latter option is quite handy, but keep in mind that it will also retain irrelevant numeric values (such as a year or a numeric ID number) in the results. Following is a demonstration of this third option:
cor(select_if(productivity, is.numeric))
## Productivity Experience
## Productivity 1.0000000 0.7838199
## Experience 0.7838199 1.0000000
Once we run the code, we notice that the value of the correlation coefficient is 0.784, indicating a strong positive relationship between Productivity and Experience. Practically speaking, how would you interpret this relationship? What does it infer (and NOT infer)?
The plot used to explore the relationship between two continuous variables is the scatterplot. Scatterplots are incredibly useful, and will play a special role when we discuss regression modeling. To create a scatterplot using ggplot, we will make use of geom_point
:
ggplot(productivity, aes(x = Experience, y = Productivity)) +
geom_point()
What do you notice? How would you interpret this plot?
To further enhance the plot, a line can be displayed that illustrates the relationship more clearly:
ggplot(productivity, aes(x = Experience, y = Productivity)) +
geom_point() +
geom_smooth(method = "lm", col = "blue", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
While ggplot is useful for simple scatterplots such as this, we will be using some different packages and functions to display scatterplots for cases where there are more than two numeric variables to explore (which is often the case).