What Is dplyr?

I get it, I’m just like you. I never studied any form of coding and most of the time I just use Excel, PowerPoint, or Word to create visuals for my work when my boss needs me to present information in some form of a visual. However, a majority of the time I am having to take information from different sources and change them to fit the context of the task my boss has given to me. It is painstaking to have to create my own visual and input all of the data that I want/need by hand every single time (as well as constanly undate it by hand). I never changed my ways because I felt that coding was for programmers and computer engineers and that I (as a Project Manager) was not able to use it for my own work.

But, if we use dplyr, we can do this entire process much faster and we can create visuals that are personalized to how we want to display our data.

dplyr is a package in R that allows you to easily manipulate tabular data. The package contains a set of functions (or “verbs”) that allow you to do things such as filter rows, select specific columns, re-order rows, add new columns with new calculations, and summariz data, to name a few.

One of the major benefits that I can see from dplyr is that it allows you to write code in shorter form that follows a step-by-step process. This not only makes it easier to write, but it also makes it easier to read and understand what the code will do when you run it.

Bottom Line: using dplyr is easy to use and a great tool for anyone (student or professional) looking to customize and display tabular data.

Now, we are going to keep this Code-Through very simple and not get too fancy with how we make our tables looks. If you want to make your tables look even better after learning dplyr, I recommend learning how to incorporate the kable() function into your code chunks. You can find more information about that here


How to Install and Load dplyr

The first thing we want to do is to install the dplyr package into R with the function install.packages() and load it with function library():


Which Dataset are We Going to Be Using?

To demonstrate the concepts of dplyr and how we can easily transform a dataset into something useful for ourselves and our bsuiness practices, we are going to be using the dataset called USArrests. This is a great dataset to use for learning because it consists of 50 observations and only 4 variables.

A dataset is exactly what it sounds like, a set of data compiled into one source for easy access. Think of it like information compiled on an Excel Sheet. Specifically, the USArrests dataset constains statistics from 1973 that covers arrests per 100,000 residents in all 50 states. The arrests are further described by murder, assault, and rape. Additionally, there is information about the percentage of the population that live in urban areas. These are what create out 4 variables. Here is that same info broken out:

  • Murder = numeric Murder arrests (per 100,000)
  • Assault = numeric Assault arrests (per 100,000)
  • UrbanPop = numeric percent of the Urban Population
  • Rape = numeric Rape arrests (per 100,000)

Leanr More: If you want to see statistical docs on this sumbject from 1975, you can find that information here.


How to Install and Load the USArrest Dataset

We are going to build off of the previous code chunk and inlcude the function data():

It is important to remeber to include the quotation marks around “USArrests”.


Important dplyr Verbs (Functions)

The following are the verbs (as well as one operator) that we are going to use in this Code-Through:

Verbs:

  • select() Allows you to select specific columns
  • filter() Allows you to filter specific rows
  • arrange() Allows you to reorder/arrange the rows
  • mutate() Allows you to create new columns based on new equations
  • summarise() Allows you to summarise values
  • group_by() Allows you to split up data, apply an action to it, and then combine it

Operator:

  • %>% This is the Pipe Operator. It allows you to turn your code into a more step-by-step layout. I love it because it helps you clean up your code so you don’t have to keep typing the same things over and over again and it makes your code easier to read.


What Does Our Data Look Like Before We Use dplyr?

This table is what the raw data looks like. As you can see, it is in tabular form and just lists everything is alphabetical order by States. 50 States means that we have 50 objects and 5 columns means that we have 5 variables (which include the States). It is a simple dataset, but that will make it much easier to understand the verbs (or functions) of dplyr.


Using dplyrVerbs

Now that we know what our data looks like, lets start changing it up!


First, Learing Pipes - %>%

Don’t let this opporator overwhelm you. We are only intoducing it at the start because we are going to be using it throughout this Code-Through. Pipes help with clean up your code and make it easier to read. Additionally, they make it so that you don’t have to continue to write the same pieces of your code over and over again.

Let’s take a look:

Notice how, in the first example, each time I want to use “Assaults”, I have to call the dataset. In this case, USArrests followed by teh $ and then Assault. While this example is very simplistic, after awhile the code chunk will get longer and longer and eventually be more difficult to read and understand what is going on.

In the second example, notice how I have only had to call USArrests one time in the entire chunk. This is because the %>% opporator works like a pipe where each line of code feeds into the next line. For this example, the dataset USArrests goes into the group_by() function, which then goes into the summarise() function, which then goes into the head() fuction. Each one is a step in the process.

You will see in the upcomming section just how useful the %>% (pipe) opporator is, especially when you combine lots of functions.


Selecting Columns - select()

This first verb (function) is foundational to dplyr; it is the bread and butter, if you will. The select() verb allows you to select certain columns to either include or not include. It is as simple as that. Lets look at some examples:

When you put it together, select() is very easy to use. You can include or exclude as many columns as you want.

NOTE You will notice that I used the function head() at the end of my code. That just turns the output into a nice table and only shows the first 6 rows. If you wanted to show more, you simply put a number inside the () like this:


Filtering Rows - filter()

Now, what would we do if we wanted to only show certain rows in our dataset? That is where the filter() verb comes in handy:

As you can see from the two examples, you can use filter() to only show certain rows based on one variable output (first example) or by outputs that meet certain criteria (second example). This is a super helpful tool to have when customizing what data to show.

Additionally, if you noticed, we are starting to combine our verbs together. In each of the two above examples, we also included select() and the code chunk ran perfectly. This is possible because of dplyr and because we are continuing to pipe everything through each other by using %>%.


Arranging Rows - arrange()

What would we do if we wanted to re-order our rows? Easy, we just use the arrange() verb:

Notice how the table changed. Now, everything is not listed in alphabetical order by States, but starts with the lowest Urban Population. Also, don’t forget that by including head() at the end of my code chunk, it only shows the first 6 rows instead of the entire dataset.

What if I want to show the same data, but in descending order?

Now, everything in the table is arranged from the highest Urban Population to the lowest becayse we added desc() inside the arrange() function. And again, you can see that these verbs can be combined into the same code chunk (we included select() to give the table a different look).


Creating Columns - mutate()

One thing you may have not noticed yet, is that I have already used the mutate() verb. The only reason you don’t see it is that I did it behind the scenes. If you were to go back to our dataset USArrests and run it on your own console, you will notice that the column of all of the States does not have a title. So, I used mutate() to create a new column. Let me show you how I did that along with how to make even more columns:

As you can see here, I did two things. First, I used mutate() to title the column of all of the state.name observations to be State. Then, I assigned the entire chunk to the USArrests dataset using <- to make sure that the dataset always displayed this new column.

Let’s use this same concept, but without assigning anything new to our dataset:

Take note of how each time a new column is added, it is added to the right, just like in Excel. Also, notice what is inside of the () of mutate(). First, we put the new title of the column, an =, and then an equation. Again, it is imporant to note that we can continue to combine these dplyr verbs together in one code chunk, like this:

We can do so much with mutate(). The possabilites are nearly endless. Play around with it and see what columns you can create. You will find that you will be able to pretty much create any column with any metric that your boss or professor is looking for based off the dataset.


Summarising Values - summarise()

The summarise() verb is useful if you want to create a summary of statistics of a specific column. It kind of works like mutate() where you can create new things with equations. Let’s take a look at an example:

We can also create multiple outputs with summarise(), like this:

In the above code chunk example, we created a whole slew of summary outputs.

Also, summarise() can also be written as summarize(). They will both produce the same results and you can even use both throughout your code. However, for continuity, it is best to pick one spelling and stick with it the whole time.

There are a bunch of things that you can create with summarise(). To see more, I recommend looking at the dplyr Cheat Sheet - GitHub , which is also linked at the bottom of the page.


Combining Data -group_by()

Lastly, we need to take a look at the group_by() verb. This one is very important to dplyr because it (1) splits up your data, (2) applies actions to it, and then (3) combines the data back together again with the new output. Lots of times you will see them called “split-apply-combine”.

We can look at this by taking the same code chunk we used above when looking at summarise():

Before, when we were just working with summarise(), we did not have the States displayed in out table. Now, by grouping the data by States, we are able to see the output for all of our values that we created in summarise() next to each State.

Now, because we are using such a simplistic dataset (USArrests), you will notice that the outputs for ave_UrbanPop, min_UrbanPo, and max_UrbanPop are all the same. However, by using this verb and the concepts of dplyr, you will be able to take more complex datasets and create some really incredible, customized tables.


Putting It All Together

Now that we have covered all of the main dplyr verbs, let’s create a table where we use all of them together and see what that produces:

This is just one example. Play around with dplyr and see what you can cook up!


Discussion

dplyr is an extremely useful tool to have in your back pocket if you are a student or a business professional and you need to manipulate and create your own table off of some dataset that you have been given. It really doesn’t take much time, effort, or even knowledge about coding to be able to implement all of its functions.

On top of that, at least from my perspective, dplyr allows you to make your code really easy to read and understand. Even if you knew nothing about coding and you were handed a couple lines of code that were using dplyr, you most likely would be able to understand what the coder was trying to do with the data.

I am still in the early phases or learning how to code in R, but after learning about the dplyr package, I feel like I have a better grip on how to (1) write code, (2) read code, and (3) manipuate tabular data to display outputs.