For many of you, this is your first time doing any serious coding, so
we’ll start with the fundamentals. This lecture covers what EDA is, why
it’s a crucial first step in any data analysis project, and how we can
use visualizations to understand the individual characteristics of our
variables.
What is Exploratory Data Analysis?
EDA isn’t about running complex models or finding final answers.
Think of it as data detective work. You’re generating
questions, looking for clues in your data through visualizations and
transformations, and then using what you find to ask even better
questions. It’s an iterative, creative process.
The main goal of EDA is to develop an understanding of your
data. You’re building a mental map of what’s in your dataset. Are
there missing values? Are there strange outliers? What’s the typical
range of values for each variable? You’re using your curiosity and
skepticism to guide this investigation.
To do this, we’ll focus on two key types of questions:
What type of variation occurs within my variables? (The topic for
today)
What type of covariation occurs between my variables? (The topic
for next time)
Let’s define a few key terms to keep us on the same page.
A variable is a quantity or quality you can
measure.
A value is a specific measurement of a
variable.
An observation is a set of measurements made
under similar conditions. It’s often a single row in your
dataset.
Tabular data is a dataset arranged in rows and
columns. We’ll be working with tidy data, where each variable is a
column, each observation is a row, and each value has its own cell. This
is the format that works best with the tidyverse tools we’ll be
using.
Understanding Variation in Your Data
Variation is the tendency for a variable’s values to change from one
observation to the next. Every variable has a unique pattern of
variation, and the best way to understand this pattern is to visualize
the distribution of its values.
We’ll use a new dataset today that’s more relevant to business: a
fictional employee attrition dataset. Let’s load the tidyverse and the
data.
library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))
Visualizing Distributions of Different Variable Types
The plot you use depends on the variable type.
Categorical Variables
A categorical variable can only take on a small set of values. In R,
these are often character
vectors or
factors
.
To visualize the distribution of a categorical variable, we use a
bar chart (geom_bar
). The height of each
bar represents the count of observations for that category. Let’s look
at the JobRole
variable in our dataset.
ggplot(data = attrition_data) +
geom_bar(mapping = aes(x = JobRole))

This plot is a bit cluttered because of the long names. Let’s make it
more readable by flipping the coordinates.
ggplot(data = attrition_data) +
geom_bar(mapping = aes(x = JobRole)) +
coord_flip()

You can manually get these counts using dplyr::count():
attrition_data %>%
count(JobRole)
Continuous Variables
A continuous variable can take on any value within an interval. Think
of things like salary, age, or employee tenure.
To visualize the distribution of a continuous variable, we use a
histogram (geom_histogram
). A histogram
divides the data range into bins and then counts how many observations
fall into each bin. Let’s look at the MonthlyIncome of our
employees.
ggplot(data = attrition_data) +
geom_histogram(mapping = aes(x = MonthlyIncome), binwidth = 1000)

The binwidth argument is crucial. Changing it can reveal different
patterns in your data. It’s always a good idea to try a few different
binwidth values.
Interpreting Visualizations: Typical & Unusual Values
Once you have your visualizations, the real detective work begins.
We’re looking for patterns, common values, and anything that stands
out.
Finding Typical Values
Tall bars in a bar chart or a high
frequency in a histogram show us the most common
values. For example, in our JobRole
bar chart,
Sales
, Executive
, and
Research Scientist
appear to be the most common
roles.
Clusters of values in a histogram can suggest
there are underlying subgroups in your data. You might
have a group of new hires with low TotalWorkingYears
and a
group of senior employees with many years of experience. We’ll explore
this more next week.
Identifying Unusual Values (Outliers)
Outliers are observations that don’t fit the general
pattern. They could be data entry errors (like a salary of $0) or they
could be real, interesting observations (a CEO with a massive
salary).
Outliers can be hard to see in histograms, especially if there are a
lot of data points. Let’s look at the DailyRate
variable.
ggplot(data = attrition_data) +
geom_histogram(mapping = aes(x = DailyRate))

The plot looks okay But what if there’s a typo, say an employee with
a DailyRate of 99999? Let’s add that to the data and see what
happens.
attrition_data_outlier <- attrition_data %>%
add_row(DailyRate = 99999)
ggplot(data = attrition_data_outlier) +
geom_histogram(mapping = aes(x = DailyRate), binwidth = 100)

The one outlier makes the rest of the plot unreadable! The single
outlier is so far out that the bins for the typical values are too short
to see.
This is where coord_cartesian()
comes in handy. It lets
us zoom in without throwing out the data.
ggplot(data = attrition_data_outlier) +
geom_histogram(mapping = aes(x = DailyRate), binwidth = 100) +
coord_cartesian(xlim = c(200, 1600))

This reveals the typical distribution again while still acknowledging
that the outlier exists.
Handling Outliers and Missing Values
When you find an outlier, the first step is to investigate it. Is it
a mistake? If so, you might want to replace it with a missing value (NA)
using ifelse()
or case_when()
.
Let’s imagine we know that any DailyRate
below 100 or
above 1600 is a data error.
attrition_data_cleaned <- attrition_data_outlier %>%
mutate(DailyRate = ifelse(DailyRate < 100 | DailyRate > 1600, NA, DailyRate))
ggplot(data = attrition_data_cleaned) +
geom_histogram(mapping = aes(x = DailyRate))

Notice how ggplot2
automatically gives us a warning that
it removed missing values. To get rid of that warning, you can add
na.rm = TRUE
.
---
title: " Introduction to EDA, Variation, and Visualizing Distributions"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

For many of you, this is your first time doing any serious coding, so we'll start with the fundamentals. This lecture covers what EDA is, why it's a crucial first step in any data analysis project, and how we can use visualizations to understand the individual characteristics of our variables.

## What is Exploratory Data Analysis?
EDA isn't about running complex models or finding final answers. Think of it as _data detective work_. You're generating questions, looking for clues in your data through visualizations and transformations, and then using what you find to ask even better questions. It's an iterative, creative process.

The main goal of EDA is to _develop an understanding of your data_. You're building a mental map of what's in your dataset. Are there missing values? Are there strange outliers? What's the typical range of values for each variable? You're using your curiosity and skepticism to guide this investigation.

To do this, we'll focus on two key types of questions:

1. What type of variation occurs within my variables? (The topic for today)

2. What type of covariation occurs between my variables? (The topic for next time)

Let's define a few key terms to keep us on the same page.

 - A __variable__ is a quantity or quality you can measure.

 - A __value__ is a specific measurement of a variable.

 - An __observation__ is a set of measurements made under similar conditions. It's often a single row in your dataset.

 - __Tabular data__ is a dataset arranged in rows and columns. We'll be working with tidy data, where each variable is a column, each observation is a row, and each value has its own cell. This is the format that works best with the tidyverse tools we'll be using.

## Understanding Variation in Your Data
Variation is the tendency for a variable's values to change from one observation to the next. Every variable has a unique pattern of variation, and the best way to understand this pattern is to visualize the distribution of its values.

We'll use a new dataset today that's more relevant to business: a fictional employee attrition dataset. Let's load the tidyverse and the data.

```{r}
library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))
```
### Visualizing Distributions of Different Variable Types

The plot you use depends on the variable type.

#### Categorical Variables
A categorical variable can only take on a small set of values. In R, these are often `character` vectors or `factors`.

To visualize the distribution of a categorical variable, we use a __bar chart__ (`geom_bar`). The height of each bar represents the count of observations for that category. Let's look at the `JobRole` variable in our dataset.

```{r}
ggplot(data = attrition_data) +
  geom_bar(mapping = aes(x = JobRole))
```

This plot is a bit cluttered because of the long names. Let's make it more readable by flipping the coordinates.

```{r}
ggplot(data = attrition_data) +
  geom_bar(mapping = aes(x = JobRole)) +
  coord_flip()
```

You can manually get these counts using dplyr::count():

```{r}
attrition_data %>%
  count(JobRole)
```

#### Continuous Variables
A continuous variable can take on any value within an interval. Think of things like salary, age, or employee tenure.

To visualize the distribution of a continuous variable, we use a __histogram__ (`geom_histogram`). A histogram divides the data range into bins and then counts how many observations fall into each bin. Let's look at the MonthlyIncome of our employees.

```{r}
ggplot(data = attrition_data) +
  geom_histogram(mapping = aes(x = MonthlyIncome), binwidth = 1000)
```

The binwidth argument is crucial. Changing it can reveal different patterns in your data. It's always a good idea to try a few different binwidth values.

## Interpreting Visualizations: Typical & Unusual Values
Once you have your visualizations, the real detective work begins. We're looking for patterns, common values, and anything that stands out.

### Finding Typical Values
 - __Tall bars in a bar chart__ or a __high frequency in a histogram__ show us __the most common values__. For example, in our `JobRole` bar chart, `Sales`, `Executive`, and `Research Scientist` appear to be the most common roles.

 - __Clusters__ of values in a histogram can suggest there are underlying __subgroups__ in your data. You might have a group of new hires with low `TotalWorkingYears` and a group of senior employees with many years of experience. We'll explore this more next week.

### Identifying Unusual Values (Outliers)
__Outliers__ are observations that don't fit the general pattern. They could be data entry errors (like a salary of $0) or they could be real, interesting observations (a CEO with a massive salary).

Outliers can be hard to see in histograms, especially if there are a lot of data points. Let's look at the `DailyRate` variable.

```{r}
ggplot(data = attrition_data) +
  geom_histogram(mapping = aes(x = DailyRate))
```

The plot looks okay But what if there's a typo, say an employee with a DailyRate of 99999? Let's add that to the data and see what happens.

```{r}
attrition_data_outlier <- attrition_data %>%
  add_row(DailyRate = 99999)

ggplot(data = attrition_data_outlier) +
  geom_histogram(mapping = aes(x = DailyRate), binwidth = 100)
```

The one outlier makes the rest of the plot unreadable! The single outlier is so far out that the bins for the typical values are too short to see.

This is where `coord_cartesian()` comes in handy. It lets us zoom in without throwing out the data.

```{r}
ggplot(data = attrition_data_outlier) +
  geom_histogram(mapping = aes(x = DailyRate), binwidth = 100) +
  coord_cartesian(xlim = c(200, 1600))
```

This reveals the typical distribution again while still acknowledging that the outlier exists.

#### Handling Outliers and Missing Values
When you find an outlier, the first step is to investigate it. Is it a mistake? If so, you might want to replace it with a missing value (NA) using `ifelse()` or `case_when()`.

Let's imagine we know that any `DailyRate` below 100 or above 1600 is a data error.

```{r}
attrition_data_cleaned <- attrition_data_outlier %>%
  mutate(DailyRate = ifelse(DailyRate < 100 | DailyRate > 1600, NA, DailyRate))

ggplot(data = attrition_data_cleaned) +
  geom_histogram(mapping = aes(x = DailyRate))
```

Notice how `ggplot2` automatically gives us a warning that it removed missing values. To get rid of that warning, you can add `na.rm = TRUE`.
