Tidyverse Mastery: Select, Filter & Mutate for Beginners

What will we learn from this lesson?

In data science, the most important task before creating graphs is organizing the data according to your needs. This notebook will teach you:

Select: How to isolate specific columns from hundreds of options.
Filter: How to remove unnecessary data using specific conditions.
Mutate: How to create new information through calculations.
Group_by: How to perform calculations separately for each category.

Environment Setup

We will use the tidyverse package, which includes both dplyr (for data manipulation) and ggplot2 (for visualization).

Code

library(tidyverse)

# Previewing the built-in 'iris' dataset
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Step 1: Selecting Columns (Select)

Objective: Isolate only the necessary variables, such as flower species and sepal dimensions.

Code

# Choosing only the necessary columns
iris %>%
  select(Species, Sepal.Length, Sepal.Width) %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()

Explanation: The select() function simplifies your dataset vertically by keeping only the columns you specify.

Step 2: Removing Unnecessary Data (Filter)

Objective: Clean the graph by showing only flowers with a sepal length greater than 5.0 cm.

Code

# Filtering out flowers with large sepals
iris %>%
  filter(Sepal.Length > 5.0) %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()

Explanation: The filter() function removes rows that do not meet your criteria, making the visual patterns much clearer.

Step 3: Creating New Information (Mutate)

Objective: Calculate a new variable, “Sepal Area” (Length × Width), and visualize it.

Code

# Creating a new 'Area' column and placing it on the Y-axis
iris %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width) %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Area, color = Species)) +
  geom_point()

Explanation: mutate() allows you to create new columns based on existing ones, helping reveal new mathematical insights.

Step 4: Grouping by Species (Group_by)

Objective: Compare the average sepal length of each flower species using dashed lines.

Code

# Grouping to find the average per species
iris %>%
  group_by(Species) %>% 
  mutate(Avg_Length = mean(Sepal.Length)) %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  geom_vline(aes(xintercept = Avg_Length, color = Species), linetype = "dashed")

Explanation: group_by() segments the data into groups, and mutate() calculates statistics for each group independently.

Summary: The 3 Golden Rules

Function	Action	Systemic Role
Select	Vertical Trimming	Choosing specific Columns.
Filter	Horizontal Trimming	Choosing specific Rows.
Mutate	Information Creation	Adding new Variables.

Congratulations! You are now equipped with the fundamental tools of the Tidyverse.