Often times large datasets are divided into smaller dataframes and
stored in separate files. For instance, point of sale data may be stored
in such a way that there’s a separate file for each month.
Alternatively, subsets of the data are extracted from a large database
in smaller sections. In either of these situations, the rows from the
different dataframes need to be stacked together to form a single
dataframe. You might call this a vertical stack because it makes the
dataframe longer.
There are also times when you want to stack columns from one
dataframe to another. You might call this a horizontal stack because it
makes the dataframe wider.
After stacking dataframes together into a single dataframe, a common
task is to sort the data in a meaningful way.
This lesson focuses on three functions from the dplyr package that
helps accomplish these tasks. The first two functions, bind_rows and
bind_cols, are used for stacking data together.
The third function, arrange, is used for sorting data.
Preliminaries
Load the dplyr, and magrittr packages.
library(dplyr)
library(magrittr)
Make sure that this file and the jan17Items.csv file are in the same
folder and that the working directory is set to that folder.
Read in the jan17Items, feb17Items, and mar17Items.csv data.
j17i <- read.csv('jan17Items.csv')
f17i <- read.csv('feb17Items.csv')
m17i <- read.csv('mar17Items.csv')
Stacking Rows of Data Using dplyr’s bind_rows Function
Now that we’ve read in the point-of-sale data for three separate
months, let’s stack the rows together using the bind_rows dplyr
function. Let’s review a couple of key points from the help
documentation for that function by running ?bind_rows in
the Console.
Notice that the first argument is ellipses, indicating that you can
combine two or more dataframes, lists, or a list of dataframes. It’s
nice that we’re not limited to combining two dataframes at a time. Let’s
try it out:
allItems <- bind_rows(j17i, f17i, m17i)
We can visually inspect the dataframe and confirm that it has
observations from each month, and in the order that we entered them into
the function. We can also add up the rows for the three separate
dataframes and verify that the sum is equal to the rows of the new
allItems dataframe that we created.
A great aspect about this function is that you can have a different
number of columns, and the column names don’t have to match up. While
this can be a very useful apsect about this function it also means that
you have to be very careful because if the column names aren’t spelled
exactly the same way then you may unintentionally create new
columns.
Stacking Columns of Data Using dplyr’s bind_cols Function
The bind_cols function works similarly to the bind_rows function. You
may have noticed that the documentation for the bind_cols function is
the same as for the bind_rows function.
To illustrate the bind_cols function, let’s first separate the
allItems dataframe into two three separate dataframes of seven columns
each.
df1 <- allItems[,1:7]
df2 <- allItems[,8:14]
df3 <- allItems[,15:21]
Now, let’s bind them back together, but in a different order.
allItems2 <- bind_cols(df1, df3, df2)
Comparing the number of rows and columns in the allItems2 dataframe
to the allItems dataframe confirms that we have a dataframe that is the
same shape. A visual inspection confirms that the order of the columns
is different.
What happens if you use the bind_cols to stack columns that have a
different number of rows? It will throw an error unless there’s only one
row in which case it will fill in every other row with the values from
that one row.
It’s important to note that there are easier ways to reorder the
columns, without having to separate the columns and then piece them back
together again. ## Sorting the Data Using dplyr’s arrange Function A
related preprocessing task is ordering the data based on values in a
specific column. This is often desirable after joining two or more
dataframes together. Let’s assume that once we create the allItems
dataframe we want to sort the data by cardholder name, and then by date.
Here’s how we can easily accomplish that using the arrange function:
allItems <- allItems %>%
arrange(CardholderName, Time)
A visual inspection of the data confirms that the data is sorted
first by CardholderName and then by Time. If you want to sort in
descending order, then you can use the desc function to
indicate that’s what you want to do.
One other useful function is the two-way pipe operator,
%<>%, which saves you from retyping a dataframe’s
name if you’re going to reassign it to the same name.
allItems %<>%
arrange(desc(CardholderName), Time)
LS0tDQp0aXRsZTogIlN0YWNraW5nIGFuZCBTb3J0aW5nIERhdGEiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KT2Z0ZW4gdGltZXMgbGFyZ2UgZGF0YXNldHMgYXJlIGRpdmlkZWQgaW50byBzbWFsbGVyIGRhdGFmcmFtZXMgYW5kIHN0b3JlZCBpbiBzZXBhcmF0ZSBmaWxlcy4gRm9yIGluc3RhbmNlLCBwb2ludCBvZiBzYWxlIGRhdGEgbWF5IGJlIHN0b3JlZCBpbiBzdWNoIGEgd2F5IHRoYXQgdGhlcmUncyBhIHNlcGFyYXRlIGZpbGUgZm9yIGVhY2ggbW9udGguIEFsdGVybmF0aXZlbHksIHN1YnNldHMgb2YgdGhlIGRhdGEgYXJlIGV4dHJhY3RlZCBmcm9tIGEgbGFyZ2UgZGF0YWJhc2UgaW4gc21hbGxlciBzZWN0aW9ucy4gSW4gZWl0aGVyIG9mIHRoZXNlIHNpdHVhdGlvbnMsIHRoZSByb3dzIGZyb20gdGhlIGRpZmZlcmVudCBkYXRhZnJhbWVzIG5lZWQgdG8gYmUgc3RhY2tlZCB0b2dldGhlciB0byBmb3JtIGEgc2luZ2xlIGRhdGFmcmFtZS4gWW91IG1pZ2h0IGNhbGwgdGhpcyBhIHZlcnRpY2FsIHN0YWNrIGJlY2F1c2UgaXQgbWFrZXMgdGhlIGRhdGFmcmFtZSBsb25nZXIuDQoNClRoZXJlIGFyZSBhbHNvIHRpbWVzIHdoZW4geW91IHdhbnQgdG8gc3RhY2sgY29sdW1ucyBmcm9tIG9uZSBkYXRhZnJhbWUgdG8gYW5vdGhlci4gWW91IG1pZ2h0IGNhbGwgdGhpcyBhIGhvcml6b250YWwgc3RhY2sgYmVjYXVzZSBpdCBtYWtlcyB0aGUgZGF0YWZyYW1lIHdpZGVyLg0KDQpBZnRlciBzdGFja2luZyBkYXRhZnJhbWVzIHRvZ2V0aGVyIGludG8gYSBzaW5nbGUgZGF0YWZyYW1lLCBhIGNvbW1vbiB0YXNrIGlzIHRvIHNvcnQgdGhlIGRhdGEgaW4gYSBtZWFuaW5nZnVsIHdheS4NCg0KVGhpcyBsZXNzb24gZm9jdXNlcyBvbiB0aHJlZSBmdW5jdGlvbnMgZnJvbSB0aGUgZHBseXIgcGFja2FnZSB0aGF0IGhlbHBzIGFjY29tcGxpc2ggdGhlc2UgdGFza3MuIFRoZSBmaXJzdCB0d28gZnVuY3Rpb25zLCBiaW5kX3Jvd3MgYW5kIGJpbmRfY29scywgYXJlIHVzZWQgZm9yIHN0YWNraW5nIGRhdGEgdG9nZXRoZXIuDQoNClRoZSB0aGlyZCBmdW5jdGlvbiwgYXJyYW5nZSwgaXMgdXNlZCBmb3Igc29ydGluZyBkYXRhLg0KDQojIyBQcmVsaW1pbmFyaWVzDQpMb2FkIHRoZSBkcGx5ciwgYW5kIG1hZ3JpdHRyIHBhY2thZ2VzLg0KYGBge3J9DQpsaWJyYXJ5KGRwbHlyKQ0KbGlicmFyeShtYWdyaXR0cikNCmBgYA0KTWFrZSBzdXJlIHRoYXQgdGhpcyBmaWxlIGFuZCB0aGUgamFuMTdJdGVtcy5jc3YgZmlsZSBhcmUgaW4gdGhlIHNhbWUgZm9sZGVyIGFuZCB0aGF0IHRoZSB3b3JraW5nIGRpcmVjdG9yeSBpcyBzZXQgdG8gdGhhdCBmb2xkZXIuDQoNClJlYWQgaW4gdGhlIGphbjE3SXRlbXMsIGZlYjE3SXRlbXMsIGFuZCBtYXIxN0l0ZW1zLmNzdiBkYXRhLg0KYGBge3J9DQpqMTdpIDwtIHJlYWQuY3N2KCdqYW4xN0l0ZW1zLmNzdicpDQpmMTdpIDwtIHJlYWQuY3N2KCdmZWIxN0l0ZW1zLmNzdicpDQptMTdpIDwtIHJlYWQuY3N2KCdtYXIxN0l0ZW1zLmNzdicpDQpgYGANCg0KIyMgU3RhY2tpbmcgUm93cyBvZiBEYXRhIFVzaW5nIGRwbHlyJ3MgYmluZF9yb3dzIEZ1bmN0aW9uDQpOb3cgdGhhdCB3ZSd2ZSByZWFkIGluIHRoZSBwb2ludC1vZi1zYWxlIGRhdGEgZm9yIHRocmVlIHNlcGFyYXRlIG1vbnRocywgbGV0J3Mgc3RhY2sgdGhlIHJvd3MgdG9nZXRoZXIgdXNpbmcgdGhlIGJpbmRfcm93cyBkcGx5ciBmdW5jdGlvbi4gTGV0J3MgcmV2aWV3IGEgY291cGxlIG9mIGtleSBwb2ludHMgZnJvbSB0aGUgaGVscCBkb2N1bWVudGF0aW9uIGZvciB0aGF0IGZ1bmN0aW9uIGJ5IHJ1bm5pbmcgYD9iaW5kX3Jvd3NgIGluIHRoZSBDb25zb2xlLg0KDQpOb3RpY2UgdGhhdCB0aGUgZmlyc3QgYXJndW1lbnQgaXMgZWxsaXBzZXMsIGluZGljYXRpbmcgdGhhdCB5b3UgY2FuIGNvbWJpbmUgdHdvIG9yIG1vcmUgZGF0YWZyYW1lcywgbGlzdHMsIG9yIGEgbGlzdCBvZiBkYXRhZnJhbWVzLiBJdCdzIG5pY2UgdGhhdCB3ZSdyZSBub3QgbGltaXRlZCB0byBjb21iaW5pbmcgdHdvIGRhdGFmcmFtZXMgYXQgYSB0aW1lLiBMZXQncyB0cnkgaXQgb3V0Og0KDQpgYGB7cn0NCmFsbEl0ZW1zIDwtIGJpbmRfcm93cyhqMTdpLCBmMTdpLCBtMTdpKQ0KYGBgDQpXZSBjYW4gdmlzdWFsbHkgaW5zcGVjdCB0aGUgZGF0YWZyYW1lIGFuZCBjb25maXJtIHRoYXQgaXQgaGFzIG9ic2VydmF0aW9ucyBmcm9tIGVhY2ggbW9udGgsIGFuZCBpbiB0aGUgb3JkZXIgdGhhdCB3ZSBlbnRlcmVkIHRoZW0gaW50byB0aGUgZnVuY3Rpb24uIFdlIGNhbiBhbHNvIGFkZCB1cCB0aGUgcm93cyBmb3IgdGhlIHRocmVlIHNlcGFyYXRlIGRhdGFmcmFtZXMgYW5kIHZlcmlmeSB0aGF0IHRoZSBzdW0gaXMgZXF1YWwgdG8gdGhlIHJvd3Mgb2YgdGhlIG5ldyBhbGxJdGVtcyBkYXRhZnJhbWUgdGhhdCB3ZSBjcmVhdGVkLg0KDQpBIGdyZWF0IGFzcGVjdCBhYm91dCB0aGlzIGZ1bmN0aW9uIGlzIHRoYXQgeW91IGNhbiBoYXZlIGEgZGlmZmVyZW50IG51bWJlciBvZiBjb2x1bW5zLCBhbmQgdGhlIGNvbHVtbiBuYW1lcyBkb24ndCBoYXZlIHRvIG1hdGNoIHVwLiBXaGlsZSB0aGlzIGNhbiBiZSBhIHZlcnkgdXNlZnVsIGFwc2VjdCBhYm91dCB0aGlzIGZ1bmN0aW9uIGl0IGFsc28gbWVhbnMgdGhhdCB5b3UgaGF2ZSB0byBiZSB2ZXJ5IGNhcmVmdWwgYmVjYXVzZSBpZiB0aGUgY29sdW1uIG5hbWVzIGFyZW4ndCBzcGVsbGVkIGV4YWN0bHkgdGhlIHNhbWUgd2F5IHRoZW4geW91IG1heSB1bmludGVudGlvbmFsbHkgY3JlYXRlIG5ldyBjb2x1bW5zLg0KDQojIyBTdGFja2luZyBDb2x1bW5zIG9mIERhdGEgVXNpbmcgZHBseXIncyBiaW5kX2NvbHMgRnVuY3Rpb24NClRoZSBiaW5kX2NvbHMgZnVuY3Rpb24gd29ya3Mgc2ltaWxhcmx5IHRvIHRoZSBiaW5kX3Jvd3MgZnVuY3Rpb24uIFlvdSBtYXkgaGF2ZSBub3RpY2VkIHRoYXQgdGhlIGRvY3VtZW50YXRpb24gZm9yIHRoZSBiaW5kX2NvbHMgZnVuY3Rpb24gaXMgdGhlIHNhbWUgYXMgZm9yIHRoZSBiaW5kX3Jvd3MgZnVuY3Rpb24uIA0KDQpUbyBpbGx1c3RyYXRlIHRoZSBiaW5kX2NvbHMgZnVuY3Rpb24sIGxldCdzIGZpcnN0IHNlcGFyYXRlIHRoZSBhbGxJdGVtcyBkYXRhZnJhbWUgaW50byB0d28gdGhyZWUgc2VwYXJhdGUgZGF0YWZyYW1lcyBvZiBzZXZlbiBjb2x1bW5zIGVhY2guDQpgYGB7cn0NCmRmMSA8LSBhbGxJdGVtc1ssMTo3XQ0KZGYyIDwtIGFsbEl0ZW1zWyw4OjE0XQ0KZGYzIDwtIGFsbEl0ZW1zWywxNToyMV0NCmBgYA0KDQpOb3csIGxldCdzIGJpbmQgdGhlbSBiYWNrIHRvZ2V0aGVyLCBidXQgaW4gYSBkaWZmZXJlbnQgb3JkZXIuDQpgYGB7cn0NCmFsbEl0ZW1zMiA8LSBiaW5kX2NvbHMoZGYxLCBkZjMsIGRmMikNCmBgYA0KQ29tcGFyaW5nIHRoZSBudW1iZXIgb2Ygcm93cyBhbmQgY29sdW1ucyBpbiB0aGUgYWxsSXRlbXMyIGRhdGFmcmFtZSB0byB0aGUgYWxsSXRlbXMgZGF0YWZyYW1lIGNvbmZpcm1zIHRoYXQgd2UgaGF2ZSBhIGRhdGFmcmFtZSB0aGF0IGlzIHRoZSBzYW1lIHNoYXBlLiBBIHZpc3VhbCBpbnNwZWN0aW9uIGNvbmZpcm1zIHRoYXQgdGhlIG9yZGVyIG9mIHRoZSBjb2x1bW5zIGlzIGRpZmZlcmVudC4NCg0KV2hhdCBoYXBwZW5zIGlmIHlvdSB1c2UgdGhlIGJpbmRfY29scyB0byBzdGFjayBjb2x1bW5zIHRoYXQgaGF2ZSBhIGRpZmZlcmVudCBudW1iZXIgb2Ygcm93cz8gSXQgd2lsbCB0aHJvdyBhbiBlcnJvciB1bmxlc3MgdGhlcmUncyBvbmx5IG9uZSByb3cgaW4gd2hpY2ggY2FzZSBpdCB3aWxsIGZpbGwgaW4gZXZlcnkgb3RoZXIgcm93IHdpdGggdGhlIHZhbHVlcyBmcm9tIHRoYXQgb25lIHJvdy4NCg0KSXQncyBpbXBvcnRhbnQgdG8gbm90ZSB0aGF0IHRoZXJlIGFyZSBlYXNpZXIgd2F5cyB0byByZW9yZGVyIHRoZSBjb2x1bW5zLCB3aXRob3V0IGhhdmluZyB0byBzZXBhcmF0ZSB0aGUgY29sdW1ucyBhbmQgdGhlbiBwaWVjZSB0aGVtIGJhY2sgdG9nZXRoZXIgYWdhaW4uDQojIyBTb3J0aW5nIHRoZSBEYXRhIFVzaW5nIGRwbHlyJ3MgYXJyYW5nZSBGdW5jdGlvbg0KQSByZWxhdGVkIHByZXByb2Nlc3NpbmcgdGFzayBpcyBvcmRlcmluZyB0aGUgZGF0YSBiYXNlZCBvbiB2YWx1ZXMgaW4gYSBzcGVjaWZpYyBjb2x1bW4uIFRoaXMgaXMgb2Z0ZW4gZGVzaXJhYmxlIGFmdGVyIGpvaW5pbmcgdHdvIG9yIG1vcmUgZGF0YWZyYW1lcyB0b2dldGhlci4gTGV0J3MgYXNzdW1lIHRoYXQgb25jZSB3ZSBjcmVhdGUgdGhlIGFsbEl0ZW1zIGRhdGFmcmFtZSB3ZSB3YW50IHRvIHNvcnQgdGhlIGRhdGEgYnkgY2FyZGhvbGRlciBuYW1lLCBhbmQgdGhlbiBieSBkYXRlLiBIZXJlJ3MgaG93IHdlIGNhbiBlYXNpbHkgYWNjb21wbGlzaCB0aGF0IHVzaW5nIHRoZSBhcnJhbmdlIGZ1bmN0aW9uOg0KYGBge3J9DQphbGxJdGVtcyA8LSBhbGxJdGVtcyAlPiUNCiAgYXJyYW5nZShDYXJkaG9sZGVyTmFtZSwgVGltZSkNCmBgYA0KQSB2aXN1YWwgaW5zcGVjdGlvbiBvZiB0aGUgZGF0YSBjb25maXJtcyB0aGF0IHRoZSBkYXRhIGlzIHNvcnRlZCBmaXJzdCBieSBDYXJkaG9sZGVyTmFtZSBhbmQgdGhlbiBieSBUaW1lLiBJZiB5b3Ugd2FudCB0byBzb3J0IGluIGRlc2NlbmRpbmcgb3JkZXIsIHRoZW4geW91IGNhbiB1c2UgdGhlIGBkZXNjYCBmdW5jdGlvbiB0byBpbmRpY2F0ZSB0aGF0J3Mgd2hhdCB5b3Ugd2FudCB0byBkby4NCg0KT25lIG90aGVyIHVzZWZ1bCBmdW5jdGlvbiBpcyB0aGUgdHdvLXdheSBwaXBlIG9wZXJhdG9yLCBgJTw+JWAsIHdoaWNoIHNhdmVzIHlvdSBmcm9tIHJldHlwaW5nIGEgZGF0YWZyYW1lJ3MgbmFtZSBpZiB5b3UncmUgZ29pbmcgdG8gcmVhc3NpZ24gaXQgdG8gdGhlIHNhbWUgbmFtZS4NCg0KYGBge3J9DQphbGxJdGVtcyAlPD4lDQogIGFycmFuZ2UoZGVzYyhDYXJkaG9sZGVyTmFtZSksIFRpbWUpDQpgYGANCiMjIENvbmNsdWRpbmcgQ29tbWVudHMNCkluIGNvbmNsdXNpb24sIHRoZXNlIGZ1bmN0aW9ucyB0byBiaW5kIGFuZCBzb3J0IGRhdGFzZXRzIGFyZSByZWFsbHkgaGFuZHkuIEkgZmluZCB0aGF0IEkgdXNlIHRoZSBiaW5kX2NvbHMgZnVuY3Rpb24gbXVjaCBsZXNzIGZyZXF1ZW50bHkgdGhhbiBJIHVzZSB0aGUgYmluZF9yb3dzIGZ1bmN0aW9uIGJlY2F1c2UgSSBvZnRlbiB3YW50IHRvIGNyZWF0ZSBjb2x1bW5zIGZyb20gdHdvIHRvdGFsbHkgZGlmZmVyZW50IGRhdGFmcmFtZXMsIGxpa2UgcG9pbnQgb2Ygc2FsZSBkYXRhIGFuZCB3ZWF0aGVyIGRhdGEuIFdoZW4gdGhpcyBpcyB0aGUgY2FzZSwgSSB1c2UgYSBqb2luIGZ1bmN0aW9uIHRvIG1ha2Ugc3VyZSB0aGF0IEknbSBjb21iaW5pbmcgdmFsdWVzIHRoYXQgc2hvdWxkIGdvIHRvZ2V0aGVyLiBUaGF0J3MgYSB0b3BpYyBmb3IgYW5vdGhlciBsZXNzb24sIHRob3VnaC4NCg0KSSBob3BlIHRoZXNlIHNpbXBsZSBleGFtcGxlcyBpbGx1c3RyYXRlIHRoZSBrZXkgYXNwZWN0cyBvZiBob3cgdGhleSB3b3JrLiBBcyB5b3UgYXBwbHkgdGhlbSB5b3UnbGwgZmluZCB0aGF0IHRoZXJlIGFyZSBvcHRpb25hbCBwYXJhbWV0ZXJzIHRoYXQgeW91IG1pZ2h0IHdhbnQgdG8gY29uc2lkZXIgdXNpbmcsIHN1Y2ggYXMgY3JlYXRpbmcgYSBuZXcgY29sdW1uIHRvIGtlZXAgdHJhY2sgb2YgdGhlIGlkIG9mIGVhY2ggZGF0YWZyYW1lIHRoYXQgeW91IHN0YWNrLg0KDQo=