Last time, we focused
on understanding variation within a single variable. This time, we’re
understanding covariation, which is the relationship between two or more
variables. This is where we start to ask more complex questions, like
“Do employees with higher salaries tend to have more years of
experience?”
library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))
Understanding Covariation
Covariation is the tendency for the values of two or more variables
to vary together in a related way. It’s the core of most statistical
analysis. When two variables covary, you can predict one using the
other. For example, if you know an employee’s
MonthlyIncome
, you have a better idea of their
JobLevel
.
Visualizing Covariation: Continuous & Categorical Variables
A very common task is to explore how a continuous variable’s
distribution changes across different categories.
Boxplots (geom_boxplot
)
While frequency polygons (geom_freqpoly
) are good for
comparing densities, boxplots are a fantastic visual shorthand for
comparing continuous distributions across categories. They are compact
and highlight the median, the interquartile range (IQR), and
outliers.
Let’s compare the MonthlyIncome
across different
JobLevels
.
ggplot(data = attrition_data, mapping = aes(x = factor(JobLevel), y = MonthlyIncome)) +
geom_boxplot()

The plot shows a clear positive relationship: as JobLevel increases,
the median MonthlyIncome also increases.
Reordering and Flipping
Sometimes your categorical variable is unordered, and the default
alphabetical order isn’t the most informative. We can use
reorder()
to arrange the categories based on a summary
statistic of the continuous variable, like the median. Let’s compare
MonthlyIncome
across different
EducationFields
.
ggplot(data = attrition_data, mapping = aes(x = reorder(EducationField, MonthlyIncome, FUN = median), y = MonthlyIncome)) +
geom_boxplot() +
coord_flip()

Flipping the coordinates (coord_flip()
) is useful when
the category names are long, making the labels more readable.
Visualizing Covariation: Two Categorical Variables
To visualize the relationship between two categorical variables, we
need to count how many observations fall into each combination of
categories.
geom_count()
This is the simplest way. geom_count() automatically calculates the
counts and represents them with the size of the circles. Let’s look at
the relationship between JobRole and Department.
ggplot(data = attrition_data) +
geom_count(mapping = aes(x = Department, y = JobRole))

Manual Counting with dplyr
and
geom_tile()
For more control, we can manually count the combinations using
dplyr
and then use a heat map (geom_tile()
) to
visualize the results. This is often more flexible.
attrition_data %>%
count(Department, JobRole) %>%
ggplot(mapping = aes(x = Department, y = JobRole)) +
geom_tile(mapping = aes(fill = n))

The darker the tile, the more employees in that
Department
/JobRole
combination. This tells us,
for example, that the Research & Development
department
has a large number of Research Scientist
and
Laboratory Technician
roles, which makes perfect sense.
Visualizing Covariation: Two Continuous Variables
The classic way to visualize the relationship between two continuous
variables is with a scatterplot using geom_point()
. Let’s
explore the relationship between MonthlyIncome
and
TotalWorkingYears
.
ggplot(data = attrition_data) +
geom_point(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome))

The Overplotting Problem
For large datasets, scatterplots can suffer from overplotting, where
many points overlap and create a big black blob, hiding the true
distribution.
Let’s pretend our dataset is much larger. Solutions include:
- Transparency (
alpha
): Make the points
semi-transparent so you can see where they are most dense.
ggplot(data = attrition_data) +
geom_point(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome), alpha = 1/5)

- 2D Binning (
geom_bin2d
or
geom_hex
): Divide the plot into 2D bins and use
color to represent the number of points in each bin. This is like a 2D
histogram.
ggplot(data = attrition_data) +
geom_bin2d(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome))

This reveals that most employees are clustered at lower
TotalWorkingYears
and lower MonthlyIncome
.
Patterns and Models
The patterns we see in our data give us clues. A strong relationship
between TotalWorkingYears
and MonthlyIncome
makes sense—we’d expect more experienced employees to earn more.
Sometimes, a dominant relationship can hide other, more subtle ones.
This is where models can be useful during EDA. We can use a model to
“remove” the effect of a variable and then look at the residuals (what’s
left over).
Let’s model the relationship between MonthlyIncome and JobLevel.
library(modelr)
mod <- lm(MonthlyIncome ~ JobLevel, data = attrition_data)
attrition_data_cleaned <- attrition_data %>%
add_residuals(mod)
ggplot(data = attrition_data_cleaned) +
geom_boxplot(mapping = aes(x = JobRole, y = resid)) +
coord_flip()

By plotting the residuals against JobRole
, we are now
looking at the relative salary differences, after accounting for
JobLevel
. This might reveal which JobRoles
are
over or underpaid relative to their JobLevel
.
ggplot2
Conciseness
As you become more comfortable, you can write more concise code by
omitting argument names like data =
and
mapping =
.
Example: Instead of
ggplot(data = attrition_data, mapping = aes(x = JobRole)) + geom_bar()
,
you can write
ggplot(attrition_data, aes(x = JobRole)) + geom_bar()
.
You’ll also see a lot of pipes (%>%
) leading into a
ggplot2 call.
attrition_data %>%
group_by(Education) %>%
summarise(MedianIncome = median(MonthlyIncome)) %>%
ggplot(aes(x = reorder(Education, MedianIncome), y = MedianIncome)) +
geom_bar(stat = "identity")
This is a very common workflow in R, and it’s a powerful way to
transition from data manipulation to visualization.
Tools
G
---
title: "Covariation and Advanced Visualization Techniques"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

[Last time](https://rpubs.com/uky994/1343682), we focused on understanding variation within a single variable. This time, we're understanding covariation, which is the relationship between two or more variables. This is where we start to ask more complex questions, like "Do employees with higher salaries tend to have more years of experience?"

```{r}
library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))
```

## Understanding Covariation
Covariation is the tendency for the values of two or more variables to vary together in a related way. It's the core of most statistical analysis. When two variables covary, you can predict one using the other. For example, if you know an employee's `MonthlyIncome`, you have a better idea of their `JobLevel`.

## Visualizing Covariation: Continuous & Categorical Variables
A very common task is to explore how a continuous variable's distribution changes across different categories.

### Boxplots (`geom_boxplot`)
While frequency polygons (`geom_freqpoly`) are good for comparing densities, boxplots are a fantastic visual shorthand for comparing continuous distributions across categories. They are compact and highlight the median, the interquartile range (IQR), and outliers.

Let's compare the `MonthlyIncome` across different `JobLevels`.

```{r}
ggplot(data = attrition_data, mapping = aes(x = factor(JobLevel), y = MonthlyIncome)) +
  geom_boxplot()
```

The plot shows a clear positive relationship: as JobLevel increases, the median MonthlyIncome also increases.

#### Reordering and Flipping
Sometimes your categorical variable is unordered, and the default alphabetical order isn't the most informative. We can use `reorder()` to arrange the categories based on a summary statistic of the continuous variable, like the median. Let's compare `MonthlyIncome` across different `EducationFields`.

```{r}
ggplot(data = attrition_data, mapping = aes(x = reorder(EducationField, MonthlyIncome, FUN = median), y = MonthlyIncome)) +
  geom_boxplot() +
  coord_flip()
```

Flipping the coordinates (`coord_flip()`) is useful when the category names are long, making the labels more readable.

## Visualizing Covariation: Two Categorical Variables
To visualize the relationship between two categorical variables, we need to count how many observations fall into each combination of categories.

### `geom_count()`
This is the simplest way. geom_count() automatically calculates the counts and represents them with the size of the circles. Let's look at the relationship between JobRole and Department.

```{r}
ggplot(data = attrition_data) +
  geom_count(mapping = aes(x = Department, y = JobRole))
```

### Manual Counting with `dplyr` and `geom_tile()`
For more control, we can manually count the combinations using `dplyr` and then use a heat map (`geom_tile()`) to visualize the results. This is often more flexible.

```{r}
attrition_data %>%
  count(Department, JobRole) %>%
  ggplot(mapping = aes(x = Department, y = JobRole)) +
  geom_tile(mapping = aes(fill = n))
```

The darker the tile, the more employees in that `Department`/`JobRole` combination. This tells us, for example, that the `Research & Development` department has a large number of `Research Scientist` and `Laboratory Technician` roles, which makes perfect sense.

## Visualizing Covariation: Two Continuous Variables
The classic way to visualize the relationship between two continuous variables is with a scatterplot using `geom_point()`. Let's explore the relationship between `MonthlyIncome` and `TotalWorkingYears`.

```{r}
ggplot(data = attrition_data) +
  geom_point(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome))
```

#### The Overplotting Problem
For large datasets, scatterplots can suffer from overplotting, where many points overlap and create a big black blob, hiding the true distribution.

Let's pretend our dataset is much larger. Solutions include:

 - __Transparency__ (`alpha`): Make the points semi-transparent so you can see where they are most dense.

```{r}
ggplot(data = attrition_data) +
  geom_point(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome), alpha = 1/5)
```

 - __2D Binning (`geom_bin2d` or `geom_hex`)__: Divide the plot into 2D bins and use color to represent the number of points in each bin. This is like a 2D histogram.

```{r}
ggplot(data = attrition_data) +
  geom_bin2d(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome))
```

This reveals that most employees are clustered at lower `TotalWorkingYears` and lower `MonthlyIncome`.

## Patterns and Models
The patterns we see in our data give us clues. A strong relationship between `TotalWorkingYears` and `MonthlyIncome` makes sense—we'd expect more experienced employees to earn more.

Sometimes, a dominant relationship can hide other, more subtle ones. This is where models can be useful during EDA. We can use a model to "remove" the effect of a variable and then look at the residuals (what's left over).

Let's model the relationship between MonthlyIncome and JobLevel.

```{r}
library(modelr)
mod <- lm(MonthlyIncome ~ JobLevel, data = attrition_data)

attrition_data_cleaned <- attrition_data %>%
  add_residuals(mod)

ggplot(data = attrition_data_cleaned) +
  geom_boxplot(mapping = aes(x = JobRole, y = resid)) +
  coord_flip()
```

By plotting the residuals against `JobRole`, we are now looking at the relative salary differences, after accounting for `JobLevel`. This might reveal which `JobRoles` are over or underpaid relative to their `JobLevel`.

## `ggplot2` Conciseness  
As you become more comfortable, you can write more concise code by omitting argument names like `data =` and `mapping =`.

Example:
Instead of `ggplot(data = attrition_data, mapping = aes(x = JobRole)) + geom_bar()`, you can write `ggplot(attrition_data, aes(x = JobRole)) + geom_bar()`.

You'll also see a lot of pipes (`%>%`) leading into a ggplot2 call.

```{r}
attrition_data %>%
  group_by(Education) %>%
  summarise(MedianIncome = median(MonthlyIncome)) %>%
  ggplot(aes(x = reorder(Education, MedianIncome), y = MedianIncome)) +
  geom_bar(stat = "identity")
```

This is a very common workflow in R, and it's a powerful way to transition from data manipulation to visualization.













Tools

G