For many of you, this is your first time doing any serious coding, so we’ll start with the fundamentals. This lecture covers what EDA is, why it’s a crucial first step in any data analysis project, and how we can use visualizations to understand the individual characteristics of our variables.

What is Exploratory Data Analysis?

EDA isn’t about running complex models or finding final answers. Think of it as data detective work. You’re generating questions, looking for clues in your data through visualizations and transformations, and then using what you find to ask even better questions. It’s an iterative, creative process.

The main goal of EDA is to develop an understanding of your data. You’re building a mental map of what’s in your dataset. Are there missing values? Are there strange outliers? What’s the typical range of values for each variable? You’re using your curiosity and skepticism to guide this investigation.

To do this, we’ll focus on two key types of questions:

  1. What type of variation occurs within my variables? (The topic for today)

  2. What type of covariation occurs between my variables? (The topic for next time)

Let’s define a few key terms to keep us on the same page.

  • A variable is a quantity or quality you can measure.

  • A value is a specific measurement of a variable.

  • An observation is a set of measurements made under similar conditions. It’s often a single row in your dataset.

  • Tabular data is a dataset arranged in rows and columns. We’ll be working with tidy data, where each variable is a column, each observation is a row, and each value has its own cell. This is the format that works best with the tidyverse tools we’ll be using.

Understanding Variation in Your Data

Variation is the tendency for a variable’s values to change from one observation to the next. Every variable has a unique pattern of variation, and the best way to understand this pattern is to visualize the distribution of its values.

We’ll use a new dataset today that’s more relevant to business: a fictional employee attrition dataset. Let’s load the tidyverse and the data.

library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))

Visualizing Distributions of Different Variable Types

The plot you use depends on the variable type.

Categorical Variables

A categorical variable can only take on a small set of values. In R, these are often character vectors or factors.

To visualize the distribution of a categorical variable, we use a bar chart (geom_bar). The height of each bar represents the count of observations for that category. Let’s look at the JobRole variable in our dataset.

ggplot(data = attrition_data) +
  geom_bar(mapping = aes(x = JobRole))

This plot is a bit cluttered because of the long names. Let’s make it more readable by flipping the coordinates.

ggplot(data = attrition_data) +
  geom_bar(mapping = aes(x = JobRole)) +
  coord_flip()

You can manually get these counts using dplyr::count():

attrition_data %>%
  count(JobRole)

Continuous Variables

A continuous variable can take on any value within an interval. Think of things like salary, age, or employee tenure.

To visualize the distribution of a continuous variable, we use a histogram (geom_histogram). A histogram divides the data range into bins and then counts how many observations fall into each bin. Let’s look at the MonthlyIncome of our employees.

ggplot(data = attrition_data) +
  geom_histogram(mapping = aes(x = MonthlyIncome), binwidth = 1000)

The binwidth argument is crucial. Changing it can reveal different patterns in your data. It’s always a good idea to try a few different binwidth values.

Interpreting Visualizations: Typical & Unusual Values

Once you have your visualizations, the real detective work begins. We’re looking for patterns, common values, and anything that stands out.

Finding Typical Values

  • Tall bars in a bar chart or a high frequency in a histogram show us the most common values. For example, in our JobRole bar chart, Sales, Executive, and Research Scientist appear to be the most common roles.

  • Clusters of values in a histogram can suggest there are underlying subgroups in your data. You might have a group of new hires with low TotalWorkingYears and a group of senior employees with many years of experience. We’ll explore this more next week.

Identifying Unusual Values (Outliers)

Outliers are observations that don’t fit the general pattern. They could be data entry errors (like a salary of $0) or they could be real, interesting observations (a CEO with a massive salary).

Outliers can be hard to see in histograms, especially if there are a lot of data points. Let’s look at the DailyRate variable.

ggplot(data = attrition_data) +
  geom_histogram(mapping = aes(x = DailyRate))

The plot looks okay But what if there’s a typo, say an employee with a DailyRate of 99999? Let’s add that to the data and see what happens.

attrition_data_outlier <- attrition_data %>%
  add_row(DailyRate = 99999)

ggplot(data = attrition_data_outlier) +
  geom_histogram(mapping = aes(x = DailyRate), binwidth = 100)

The one outlier makes the rest of the plot unreadable! The single outlier is so far out that the bins for the typical values are too short to see.

This is where coord_cartesian() comes in handy. It lets us zoom in without throwing out the data.

ggplot(data = attrition_data_outlier) +
  geom_histogram(mapping = aes(x = DailyRate), binwidth = 100) +
  coord_cartesian(xlim = c(200, 1600))

This reveals the typical distribution again while still acknowledging that the outlier exists.

Handling Outliers and Missing Values

When you find an outlier, the first step is to investigate it. Is it a mistake? If so, you might want to replace it with a missing value (NA) using ifelse() or case_when().

Let’s imagine we know that any DailyRate below 100 or above 1600 is a data error.

attrition_data_cleaned <- attrition_data_outlier %>%
  mutate(DailyRate = ifelse(DailyRate < 100 | DailyRate > 1600, NA, DailyRate))

ggplot(data = attrition_data_cleaned) +
  geom_histogram(mapping = aes(x = DailyRate))

Notice how ggplot2 automatically gives us a warning that it removed missing values. To get rid of that warning, you can add na.rm = TRUE.

---
title: " Introduction to EDA, Variation, and Visualizing Distributions"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

For many of you, this is your first time doing any serious coding, so we'll start with the fundamentals. This lecture covers what EDA is, why it's a crucial first step in any data analysis project, and how we can use visualizations to understand the individual characteristics of our variables.

## What is Exploratory Data Analysis?
EDA isn't about running complex models or finding final answers. Think of it as _data detective work_. You're generating questions, looking for clues in your data through visualizations and transformations, and then using what you find to ask even better questions. It's an iterative, creative process.

The main goal of EDA is to _develop an understanding of your data_. You're building a mental map of what's in your dataset. Are there missing values? Are there strange outliers? What's the typical range of values for each variable? You're using your curiosity and skepticism to guide this investigation.

To do this, we'll focus on two key types of questions:

1. What type of variation occurs within my variables? (The topic for today)

2. What type of covariation occurs between my variables? (The topic for next time)

Let's define a few key terms to keep us on the same page.

 - A __variable__ is a quantity or quality you can measure.

 - A __value__ is a specific measurement of a variable.

 - An __observation__ is a set of measurements made under similar conditions. It's often a single row in your dataset.

 - __Tabular data__ is a dataset arranged in rows and columns. We'll be working with tidy data, where each variable is a column, each observation is a row, and each value has its own cell. This is the format that works best with the tidyverse tools we'll be using.

## Understanding Variation in Your Data
Variation is the tendency for a variable's values to change from one observation to the next. Every variable has a unique pattern of variation, and the best way to understand this pattern is to visualize the distribution of its values.

We'll use a new dataset today that's more relevant to business: a fictional employee attrition dataset. Let's load the tidyverse and the data.

```{r}
library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))
```
### Visualizing Distributions of Different Variable Types

The plot you use depends on the variable type.

#### Categorical Variables
A categorical variable can only take on a small set of values. In R, these are often `character` vectors or `factors`.

To visualize the distribution of a categorical variable, we use a __bar chart__ (`geom_bar`). The height of each bar represents the count of observations for that category. Let's look at the `JobRole` variable in our dataset.

```{r}
ggplot(data = attrition_data) +
  geom_bar(mapping = aes(x = JobRole))
```

This plot is a bit cluttered because of the long names. Let's make it more readable by flipping the coordinates.

```{r}
ggplot(data = attrition_data) +
  geom_bar(mapping = aes(x = JobRole)) +
  coord_flip()
```

You can manually get these counts using dplyr::count():

```{r}
attrition_data %>%
  count(JobRole)
```

#### Continuous Variables
A continuous variable can take on any value within an interval. Think of things like salary, age, or employee tenure.

To visualize the distribution of a continuous variable, we use a __histogram__ (`geom_histogram`). A histogram divides the data range into bins and then counts how many observations fall into each bin. Let's look at the MonthlyIncome of our employees.

```{r}
ggplot(data = attrition_data) +
  geom_histogram(mapping = aes(x = MonthlyIncome), binwidth = 1000)
```

The binwidth argument is crucial. Changing it can reveal different patterns in your data. It's always a good idea to try a few different binwidth values.

## Interpreting Visualizations: Typical & Unusual Values
Once you have your visualizations, the real detective work begins. We're looking for patterns, common values, and anything that stands out.

### Finding Typical Values
 - __Tall bars in a bar chart__ or a __high frequency in a histogram__ show us __the most common values__. For example, in our `JobRole` bar chart, `Sales`, `Executive`, and `Research Scientist` appear to be the most common roles.

 - __Clusters__ of values in a histogram can suggest there are underlying __subgroups__ in your data. You might have a group of new hires with low `TotalWorkingYears` and a group of senior employees with many years of experience. We'll explore this more next week.

### Identifying Unusual Values (Outliers)
__Outliers__ are observations that don't fit the general pattern. They could be data entry errors (like a salary of $0) or they could be real, interesting observations (a CEO with a massive salary).

Outliers can be hard to see in histograms, especially if there are a lot of data points. Let's look at the `DailyRate` variable.

```{r}
ggplot(data = attrition_data) +
  geom_histogram(mapping = aes(x = DailyRate))
```

The plot looks okay But what if there's a typo, say an employee with a DailyRate of 99999? Let's add that to the data and see what happens.

```{r}
attrition_data_outlier <- attrition_data %>%
  add_row(DailyRate = 99999)

ggplot(data = attrition_data_outlier) +
  geom_histogram(mapping = aes(x = DailyRate), binwidth = 100)
```

The one outlier makes the rest of the plot unreadable! The single outlier is so far out that the bins for the typical values are too short to see.

This is where `coord_cartesian()` comes in handy. It lets us zoom in without throwing out the data.

```{r}
ggplot(data = attrition_data_outlier) +
  geom_histogram(mapping = aes(x = DailyRate), binwidth = 100) +
  coord_cartesian(xlim = c(200, 1600))
```

This reveals the typical distribution again while still acknowledging that the outlier exists.

#### Handling Outliers and Missing Values
When you find an outlier, the first step is to investigate it. Is it a mistake? If so, you might want to replace it with a missing value (NA) using `ifelse()` or `case_when()`.

Let's imagine we know that any `DailyRate` below 100 or above 1600 is a data error.

```{r}
attrition_data_cleaned <- attrition_data_outlier %>%
  mutate(DailyRate = ifelse(DailyRate < 100 | DailyRate > 1600, NA, DailyRate))

ggplot(data = attrition_data_cleaned) +
  geom_histogram(mapping = aes(x = DailyRate))
```

Notice how `ggplot2` automatically gives us a warning that it removed missing values. To get rid of that warning, you can add `na.rm = TRUE`.
