Last time, we focused on understanding variation within a single variable. This time, we’re understanding covariation, which is the relationship between two or more variables. This is where we start to ask more complex questions, like “Do employees with higher salaries tend to have more years of experience?”

library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))

Understanding Covariation

Covariation is the tendency for the values of two or more variables to vary together in a related way. It’s the core of most statistical analysis. When two variables covary, you can predict one using the other. For example, if you know an employee’s MonthlyIncome, you have a better idea of their JobLevel.

Visualizing Covariation: Continuous & Categorical Variables

A very common task is to explore how a continuous variable’s distribution changes across different categories.

Boxplots (geom_boxplot)

While frequency polygons (geom_freqpoly) are good for comparing densities, boxplots are a fantastic visual shorthand for comparing continuous distributions across categories. They are compact and highlight the median, the interquartile range (IQR), and outliers.

Let’s compare the MonthlyIncome across different JobLevels.

ggplot(data = attrition_data, mapping = aes(x = factor(JobLevel), y = MonthlyIncome)) +
  geom_boxplot()

The plot shows a clear positive relationship: as JobLevel increases, the median MonthlyIncome also increases.

Reordering and Flipping

Sometimes your categorical variable is unordered, and the default alphabetical order isn’t the most informative. We can use reorder() to arrange the categories based on a summary statistic of the continuous variable, like the median. Let’s compare MonthlyIncome across different EducationFields.

ggplot(data = attrition_data, mapping = aes(x = reorder(EducationField, MonthlyIncome, FUN = median), y = MonthlyIncome)) +
  geom_boxplot() +
  coord_flip()

Flipping the coordinates (coord_flip()) is useful when the category names are long, making the labels more readable.

Visualizing Covariation: Two Categorical Variables

To visualize the relationship between two categorical variables, we need to count how many observations fall into each combination of categories.

geom_count()

This is the simplest way. geom_count() automatically calculates the counts and represents them with the size of the circles. Let’s look at the relationship between JobRole and Department.

ggplot(data = attrition_data) +
  geom_count(mapping = aes(x = Department, y = JobRole))

Manual Counting with dplyr and geom_tile()

For more control, we can manually count the combinations using dplyr and then use a heat map (geom_tile()) to visualize the results. This is often more flexible.

attrition_data %>%
  count(Department, JobRole) %>%
  ggplot(mapping = aes(x = Department, y = JobRole)) +
  geom_tile(mapping = aes(fill = n))

The darker the tile, the more employees in that Department/JobRole combination. This tells us, for example, that the Research & Development department has a large number of Research Scientist and Laboratory Technician roles, which makes perfect sense.

Visualizing Covariation: Two Continuous Variables

The classic way to visualize the relationship between two continuous variables is with a scatterplot using geom_point(). Let’s explore the relationship between MonthlyIncome and TotalWorkingYears.

ggplot(data = attrition_data) +
  geom_point(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome))

The Overplotting Problem

For large datasets, scatterplots can suffer from overplotting, where many points overlap and create a big black blob, hiding the true distribution.

Let’s pretend our dataset is much larger. Solutions include:

  • Transparency (alpha): Make the points semi-transparent so you can see where they are most dense.
ggplot(data = attrition_data) +
  geom_point(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome), alpha = 1/5)

  • 2D Binning (geom_bin2d or geom_hex): Divide the plot into 2D bins and use color to represent the number of points in each bin. This is like a 2D histogram.
ggplot(data = attrition_data) +
  geom_bin2d(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome))

This reveals that most employees are clustered at lower TotalWorkingYears and lower MonthlyIncome.

Patterns and Models

The patterns we see in our data give us clues. A strong relationship between TotalWorkingYears and MonthlyIncome makes sense—we’d expect more experienced employees to earn more.

Sometimes, a dominant relationship can hide other, more subtle ones. This is where models can be useful during EDA. We can use a model to “remove” the effect of a variable and then look at the residuals (what’s left over).

Let’s model the relationship between MonthlyIncome and JobLevel.

library(modelr)
mod <- lm(MonthlyIncome ~ JobLevel, data = attrition_data)

attrition_data_cleaned <- attrition_data %>%
  add_residuals(mod)

ggplot(data = attrition_data_cleaned) +
  geom_boxplot(mapping = aes(x = JobRole, y = resid)) +
  coord_flip()

By plotting the residuals against JobRole, we are now looking at the relative salary differences, after accounting for JobLevel. This might reveal which JobRoles are over or underpaid relative to their JobLevel.

ggplot2 Conciseness

As you become more comfortable, you can write more concise code by omitting argument names like data = and mapping =.

Example: Instead of ggplot(data = attrition_data, mapping = aes(x = JobRole)) + geom_bar(), you can write ggplot(attrition_data, aes(x = JobRole)) + geom_bar().

You’ll also see a lot of pipes (%>%) leading into a ggplot2 call.

attrition_data %>%
  group_by(Education) %>%
  summarise(MedianIncome = median(MonthlyIncome)) %>%
  ggplot(aes(x = reorder(Education, MedianIncome), y = MedianIncome)) +
  geom_bar(stat = "identity")

This is a very common workflow in R, and it’s a powerful way to transition from data manipulation to visualization.

Tools

G

---
title: "Covariation and Advanced Visualization Techniques"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

[Last time](https://rpubs.com/uky994/1343682), we focused on understanding variation within a single variable. This time, we're understanding covariation, which is the relationship between two or more variables. This is where we start to ask more complex questions, like "Do employees with higher salaries tend to have more years of experience?"

```{r}
library(tidyverse)
attrition_data <- read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv")
#Convert chr columns to factor
attrition_data <- attrition_data %>% mutate(across(where(is.character), as.factor))
```

## Understanding Covariation
Covariation is the tendency for the values of two or more variables to vary together in a related way. It's the core of most statistical analysis. When two variables covary, you can predict one using the other. For example, if you know an employee's `MonthlyIncome`, you have a better idea of their `JobLevel`.

## Visualizing Covariation: Continuous & Categorical Variables
A very common task is to explore how a continuous variable's distribution changes across different categories.

### Boxplots (`geom_boxplot`)
While frequency polygons (`geom_freqpoly`) are good for comparing densities, boxplots are a fantastic visual shorthand for comparing continuous distributions across categories. They are compact and highlight the median, the interquartile range (IQR), and outliers.

Let's compare the `MonthlyIncome` across different `JobLevels`.

```{r}
ggplot(data = attrition_data, mapping = aes(x = factor(JobLevel), y = MonthlyIncome)) +
  geom_boxplot()
```

The plot shows a clear positive relationship: as JobLevel increases, the median MonthlyIncome also increases.

#### Reordering and Flipping
Sometimes your categorical variable is unordered, and the default alphabetical order isn't the most informative. We can use `reorder()` to arrange the categories based on a summary statistic of the continuous variable, like the median. Let's compare `MonthlyIncome` across different `EducationFields`.

```{r}
ggplot(data = attrition_data, mapping = aes(x = reorder(EducationField, MonthlyIncome, FUN = median), y = MonthlyIncome)) +
  geom_boxplot() +
  coord_flip()
```

Flipping the coordinates (`coord_flip()`) is useful when the category names are long, making the labels more readable.

## Visualizing Covariation: Two Categorical Variables
To visualize the relationship between two categorical variables, we need to count how many observations fall into each combination of categories.

### `geom_count()`
This is the simplest way. geom_count() automatically calculates the counts and represents them with the size of the circles. Let's look at the relationship between JobRole and Department.

```{r}
ggplot(data = attrition_data) +
  geom_count(mapping = aes(x = Department, y = JobRole))
```

### Manual Counting with `dplyr` and `geom_tile()`
For more control, we can manually count the combinations using `dplyr` and then use a heat map (`geom_tile()`) to visualize the results. This is often more flexible.

```{r}
attrition_data %>%
  count(Department, JobRole) %>%
  ggplot(mapping = aes(x = Department, y = JobRole)) +
  geom_tile(mapping = aes(fill = n))
```

The darker the tile, the more employees in that `Department`/`JobRole` combination. This tells us, for example, that the `Research & Development` department has a large number of `Research Scientist` and `Laboratory Technician` roles, which makes perfect sense.

## Visualizing Covariation: Two Continuous Variables
The classic way to visualize the relationship between two continuous variables is with a scatterplot using `geom_point()`. Let's explore the relationship between `MonthlyIncome` and `TotalWorkingYears`.

```{r}
ggplot(data = attrition_data) +
  geom_point(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome))
```

#### The Overplotting Problem
For large datasets, scatterplots can suffer from overplotting, where many points overlap and create a big black blob, hiding the true distribution.

Let's pretend our dataset is much larger. Solutions include:

 - __Transparency__ (`alpha`): Make the points semi-transparent so you can see where they are most dense.

```{r}
ggplot(data = attrition_data) +
  geom_point(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome), alpha = 1/5)
```

 - __2D Binning (`geom_bin2d` or `geom_hex`)__: Divide the plot into 2D bins and use color to represent the number of points in each bin. This is like a 2D histogram.

```{r}
ggplot(data = attrition_data) +
  geom_bin2d(mapping = aes(x = TotalWorkingYears, y = MonthlyIncome))
```

This reveals that most employees are clustered at lower `TotalWorkingYears` and lower `MonthlyIncome`.

## Patterns and Models
The patterns we see in our data give us clues. A strong relationship between `TotalWorkingYears` and `MonthlyIncome` makes sense—we'd expect more experienced employees to earn more.

Sometimes, a dominant relationship can hide other, more subtle ones. This is where models can be useful during EDA. We can use a model to "remove" the effect of a variable and then look at the residuals (what's left over).

Let's model the relationship between MonthlyIncome and JobLevel.

```{r}
library(modelr)
mod <- lm(MonthlyIncome ~ JobLevel, data = attrition_data)

attrition_data_cleaned <- attrition_data %>%
  add_residuals(mod)

ggplot(data = attrition_data_cleaned) +
  geom_boxplot(mapping = aes(x = JobRole, y = resid)) +
  coord_flip()
```

By plotting the residuals against `JobRole`, we are now looking at the relative salary differences, after accounting for `JobLevel`. This might reveal which `JobRoles` are over or underpaid relative to their `JobLevel`.

## `ggplot2` Conciseness  
As you become more comfortable, you can write more concise code by omitting argument names like `data =` and `mapping =`.

Example:
Instead of `ggplot(data = attrition_data, mapping = aes(x = JobRole)) + geom_bar()`, you can write `ggplot(attrition_data, aes(x = JobRole)) + geom_bar()`.

You'll also see a lot of pipes (`%>%`) leading into a ggplot2 call.

```{r}
attrition_data %>%
  group_by(Education) %>%
  summarise(MedianIncome = median(MonthlyIncome)) %>%
  ggplot(aes(x = reorder(Education, MedianIncome), y = MedianIncome)) +
  geom_bar(stat = "identity")
```

This is a very common workflow in R, and it's a powerful way to transition from data manipulation to visualization.













Tools

G