For many of you, this is your first time doing any serious coding, so
we’ll start with the fundamentals. This lecture covers what EDA is, why
it’s a crucial first step in any data analysis project, and how we can
use visualizations to understand the individual characteristics of our
variables.
What is Exploratory Data Analysis?
EDA isn’t about running complex models or finding final answers.
Think of it as data detective work. You’re generating
questions, looking for clues in your data through visualizations and
transformations, and then using what you find to ask even better
questions. It’s an iterative, creative process.
The main goal of EDA is to develop an understanding of your
data. You’re building a mental map of what’s in your dataset. Are
there missing values? Are there strange outliers? What’s the typical
range of values for each variable? You’re using your curiosity and
skepticism to guide this investigation.
To do this, we’ll focus on two key types of questions:
What type of variation occurs within my variables? (The topic for
today)
What type of covariation occurs between my variables? (The topic
for next time)
Let’s define a few key terms to keep us on the same page.
A variable is a quantity or quality you can
measure.
A value is a specific measurement of a
variable.
An observation is a set of measurements made
under similar conditions. It’s often a single row in your
dataset.
Tabular data is a dataset arranged in rows and
columns. We’ll be working with tidy data, where each variable is a
column, each observation is a row, and each value has its own cell. This
is the format that works best with the tidyverse tools we’ll be
using.
Understanding Variation in Your Data
Variation is the tendency for a variable’s values to change from one
observation to the next. Every variable has a unique pattern of
variation, and the best way to understand this pattern is to visualize
the distribution of its values.
We’ll use a new dataset today that’s more relevant to business: a
fictional employee attrition dataset. Let’s load the tidyverse and the
data.
library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))
Visualizing Distributions of Different Variable Types
The plot you use depends on the variable type.
Categorical Variables
A categorical variable can only take on a small set of values. In R,
these are often character
vectors or
factors
.
To visualize the distribution of a categorical variable, we use a
bar chart (geom_bar
). The height of each
bar represents the count of observations for that category. Let’s look
at the JobRole
variable in our dataset.
ggplot(data = attrition_data) +
geom_bar(mapping = aes(x = JobRole))

This plot is a bit cluttered because of the long names. Let’s make it
more readable by flipping the coordinates.
ggplot(data = attrition_data) +
geom_bar(mapping = aes(x = JobRole)) +
coord_flip()

You can manually get these counts using dplyr::count():
attrition_data %>%
count(JobRole)
Continuous Variables
A continuous variable can take on any value within an interval. Think
of things like salary, age, or employee tenure.
To visualize the distribution of a continuous variable, we use a
histogram (geom_histogram
). A histogram
divides the data range into bins and then counts how many observations
fall into each bin. Let’s look at the MonthlyIncome of our
employees.
ggplot(data = attrition_data) +
geom_histogram(mapping = aes(x = MonthlyIncome), binwidth = 1000)

The binwidth argument is crucial. Changing it can reveal different
patterns in your data. It’s always a good idea to try a few different
binwidth values.
Interpreting Visualizations: Typical & Unusual Values
Once you have your visualizations, the real detective work begins.
We’re looking for patterns, common values, and anything that stands
out.
Finding Typical Values
Tall bars in a bar chart or a high
frequency in a histogram show us the most common
values. For example, in our JobRole
bar chart,
Sales
, Executive
, and
Research Scientist
appear to be the most common
roles.
Clusters of values in a histogram can suggest
there are underlying subgroups in your data. You might
have a group of new hires with low TotalWorkingYears
and a
group of senior employees with many years of experience. We’ll explore
this more next week.
Identifying Unusual Values (Outliers)
Outliers are observations that don’t fit the general
pattern. They could be data entry errors (like a salary of $0) or they
could be real, interesting observations (a CEO with a massive
salary).
Outliers can be hard to see in histograms, especially if there are a
lot of data points. Let’s look at the DailyRate
variable.
ggplot(data = attrition_data) +
geom_histogram(mapping = aes(x = DailyRate))

The plot looks okay But what if there’s a typo, say an employee with
a DailyRate of 99999? Let’s add that to the data and see what
happens.
attrition_data_outlier <- attrition_data %>%
add_row(DailyRate = 99999)
ggplot(data = attrition_data_outlier) +
geom_histogram(mapping = aes(x = DailyRate), binwidth = 100)

The one outlier makes the rest of the plot unreadable! The single
outlier is so far out that the bins for the typical values are too short
to see.
This is where coord_cartesian()
comes in handy. It lets
us zoom in without throwing out the data.
ggplot(data = attrition_data_outlier) +
geom_histogram(mapping = aes(x = DailyRate), binwidth = 100) +
coord_cartesian(xlim = c(200, 1600))

This reveals the typical distribution again while still acknowledging
that the outlier exists.
Handling Outliers and Missing Values
When you find an outlier, the first step is to investigate it. Is it
a mistake? If so, you might want to replace it with a missing value (NA)
using ifelse()
or case_when()
.
Let’s imagine we know that any DailyRate
below 100 or
above 1600 is a data error.
attrition_data_cleaned <- attrition_data_outlier %>%
mutate(DailyRate = ifelse(DailyRate < 100 | DailyRate > 1600, NA, DailyRate))
ggplot(data = attrition_data_cleaned) +
geom_histogram(mapping = aes(x = DailyRate))

Notice how ggplot2
automatically gives us a warning that
it removed missing values. To get rid of that warning, you can add
na.rm = TRUE
.
