In this explainer I will go over the basic and advanced levels of the simple facet of R that I find to be the most useful: filtering.
Filtering in its most basic sense in R serves to show you values in a dataset that match a set criteria.
It is a simple and valuable way to get answers to basic questions about data, and primarily uses the dplyr package.
For this explainer we will be using one of the basic R datasets, the Motor Trend Car Road Tests.
data("mtcars")
The first step in filtering is understanding the data you are working with.
Beyond just looking at the original CSV file or anything of the sort, looking at the head of the dataset tends to work just fine here.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
This particular dataset contains columns for miles per gallon, number of cylinders, displacement in cubic inches, gross horsepower, rear axel ratio, weight in 1000’s of pounds, quarter mile time, enginge cylinder configuration, transmission, number of forward gears, and number of carburetors.
The next step is to select the values that you specifically want to filter. You can do this in one fell swoop to get results, but I usually like to select columns as a new dataframe to make things a bit easier to follow and a bit cleaner.
We will name this new dataframe “mtcars_simple,” and we will look at the miles per gallon, the cylinders, the gross horsepower, the quarter mile time, and the transmission type. This will be the data frame we primarily work with.
mtcars_simple <- mtcars %>%
select (mpg, cyl, hp, qsec, am)
Filtering for a value is the simplest way to find out information from data. Depending on which logical operator you use in the code, you can see value that are greater than, less than, greater than or equal to, less than or equal to, or equal to another value.
The basic structure of the filering command can be described as “filter(dataset, condition).”
For this example, we will be creating another dataframe called “mtcars_fast” showing cars that have a quarter mile time of less than or equal to 17.
mtcars_fast <- mtcars_simple %>%
filter(qsec <= 17)
Filtering for characters rather than values is fairly simple. You just have to remember the basic rule of thumb that any value or column name with a space should be between two quotation marks.
Unfortunately, this dataset is based off numerical values, so this function serves no true purpose. It would, however, look something like this if I were to filter for a name, specifically Mazda.
# mtcars_names <- mtcars_simple %>%
# filter(name == "Mazda")
You could also use a negative logical condition to look for all the names that are not that one, but that will be covered later.
A little fun fact as well, if you use the “>” or “<” conditions in this context and a letter as the value, it will display the values that begin with the letters either before or after the selected letter in the alphabet.
You can also filter for multiple values or conditions to get answers about the data.
Filtering for multiple values in the same column only requires the us of the command “%in%”. Here we will filter for vehicles that have 4 or 8 cylinders and call the new dataframe “mtcars_cylinders.”
mtcars_cylinders <- mtcars_simple%>%
filter(cyl %in% c(4, 8))
To filter with multiple conditions you use the “and” and “or” commands, which are “,” and “|” respectively. Using these commands, you can chain together multiple commands to return a very specific result.
You can even use this to filter across a few specific columns. The new dataframe for this operation will be called “mtcars_hp_cylinder,” and will identify cars with 6 or 8 cylinders that have quarter mile times less than or equal to 16 seconds.
mtcars_hp_cylinder <- mtcars_simple %>%
filter (qsec <= 16, (cyl == 6 | 8))
You can also use the “xor” function before the conditions to filter for rows where one of the conditions is true, not both.
In certain datasets, especially those with similar columns, it may be useful to filter the entire dataset by a certain condition.
This can be accomplished with any of these three functions: “filter_all,” “filter_if,” and “filer_at.”
“Filter_all” does exactly what it sounds like. It filters all the columns through a condition. For the position normally reserved for the column name in the command, we can instead use “any_vars” as an “or” condition and “all_vars” as an “and” condition.
In this example, we will filter for all the values that are greater than 1. The dataframe will be called “mtcars_1”
mtcars_1 <- mtcars_simple %>%
filter_all (all_vars(. >= 1))
“Filter_if” works more so to filter specific coulmns out of the whole dataset if they fit certain conditions while still displaying the other data. Its operators include “is.numeric,” “is.integer,” “is.double,” “is.logical,” and “is.factor.” This dataset is not a good example for this operation, but it would likely look something like this if I were to filter only numeric columns for a value.
# mtcars_numeric <- mtcars_simple %>%
# filter_if(is.numeric, (all.vars == 1))
“Filter_at” is the final, and arguably most useful, command for filtering across columns.
It allows you to perform filters within a selected set of variables denoted by the “vars” command.That command can be placed in front of any sort of column identifier, so it is very versatile.
You can then filter with the same “any_vars” and “all_vars” commands mentioned above.
In the example below the new dataframe will select for rows that have mpg numbers over 20 or quarter miles times greater than 20.It will be called “mtcars_mileage_slow”
mtcars_mileage_slow <- mtcars_simple %>%
filter_at(vars(mpg, qsec), any_vars (.>20))
All of these commands can have their logic reversed by simply adding the “!” command inside the filter function.This works with all of the previous commands, but can be best demonstrated with a simple filter of the quarter mile times.
The new dataframe will be called “mtcars_fastest.”
mtcars_fastest <- mtcars_simple %>%
filter (!qsec > 14)
Using two of the previously mentioned techniques, one of the logic modifiers from “filter_if” and the “!” from reversing the logic, you can quickly and easily remove any NA values in your data.
This can make data easier to work with and can provide better results for all the afformentioned filters.
Unfortunately this data contains no NA values, but you would accomplish this by using the “if.NA” function like so.
# mtcars_clean <- mtcars_simple %>%
# filter (!is.na(hp))
Filtering is a simple and extremely useful function when working with any data in R.
It can go a long way with cleaning up data, selecting specific data, and answering questions both basic and complex about the data.
While it can be a bit confusing at times, I hope this explainer broke it down to the basics so you can filter out exactly what you need from your data in the future.