The Systemic Guide to Data Filtering

Author

Abdullah Al Shamim

Filtering allows you to “zoom in” on specific rows that meet your research criteria while hiding the rest

Phase 1: Basic Logical Comparisons

The foundation of filtering relies on symbols like > (greater than), < (less than), and == (equal to).

Find observations above a certain threshold.

Code
library(tidyverse)

# Mammals sleeping more than 18 hours
long_sleepers <- msleep %>% 
  select(name, sleep_total) %>% 
  filter(sleep_total > 18)

Use ! to reverse any condition. This is useful for finding everything except a specific group.

Code
# Mammals NOT sleeping more than 18 hours
short_sleepers <- msleep %>% 
  select(name, sleep_total) %>% 
  filter(!sleep_total > 18)

Phase 2: Combining Multiple Conditions

Real-world analysis often requires meeting multiple criteria simultaneously.

Rows must meet all conditions.

Code
# Must be a Primate AND weigh more than 20kg
heavy_primates <- msleep %>% 
  select(name, order, bodywt, sleep_total) %>% 
  filter(order == "Primates", bodywt > 20)

Rows must meet at least one condition.

Code
# Is either a Primate OR weighs more than 20kg
primates_or_heavy <- msleep %>% 
  select(name, order, bodywt, sleep_total) %>% 
  filter(order == "Primates" | bodywt > 20)

Phase 3: Filtering Categorical Lists

When searching for multiple names or categories, typing | repeatedly is inefficient. Instead, use the In-operator (%in%).

Code
# Searching for specific names one by one
manual_search <- msleep %>% 
  select(name, sleep_total) %>% 
  filter(name == "Rabbit" | name == "Tiger" | name == "Horse")

Use a “concatenate” list to search for multiple values elegantly

Code
# Search using a vector list
pro_search <- msleep %>% 
  select(name, sleep_total) %>% 
  filter(name %in% c("Rabbit", "Tiger", "Horse"))

Phase 4: Range and Proximity Filtering

Sometimes you don’t need an exact number, but rather a “neighborhood” of values.

Includes the boundary numbers provided.

Code
# Sleep total between 10 and 16 hours (inclusive)
mid_sleepers <- msleep %>% 
  select(name, sleep_total) %>%
  filter(between(sleep_total, 10, 16))

Finds values close to a target within a specified tolerance.

Code
# Target 17 hours, with a 0.5 hour buffer (16.5 to 17.5)
approx_sleepers <- msleep %>% 
  select(name, sleep_total) %>%
  filter(near(sleep_total, 17, tol = 0.5))

Phase 5: Handling Missing Data (NA)

Missing values are a unique “state” in R. You cannot use == NA; you must use the is.na() function.

Find rows where information is missing.

Code
# Show mammals with missing conservation status
missing_info <- msleep %>% 
  select(name, conservation, sleep_total) %>%
  filter(is.na(conservation))

Find rows where information is complete.

Code
# Keep only rows where conservation status is known
complete_info <- msleep %>% 
  select(name, conservation, sleep_total) %>%
  filter(!is.na(conservation))

🎓 Systemic Summary for Learners

Operator/Function Systemic Role Meaning
== Equality Must match exactly.
!= Inequality Must NOT match.
, or & Intersection Both conditions must be TRUE.
| Union Either condition can be TRUE.
%in% Set Membership Value must be in the provided list.
is.na() Null Check Identifies missing values.
between() Range Check Between \(x\) and \(y\) (inclusive).

Pro-Tip: Filtering is the first step in Data Cleaning. Always check the number of rows remaining with nrow() after filtering to ensure you haven’t accidentally deleted your entire dataset!

Courses that contain short and easy to digest video content are available at premieranalytics.com.bd Each lessons uses data that is built into R or comes with installed packages so you can replicated the work at home. premieranalytics.com.bd also includes teaching on statistics and research methods.