This lesson introduces two functions from the dplyr package for aggregating data: the group_by() function and the summarise() function.

We will also review how to use the lubridate package for converting strings to datetime types, as well as for rounding datetime values to date values.

Finally, we will introduce the n_distinct() function for calculating the distinct number of values for different groups.

Preliminaries

Load the dplyr, magrittr, and lubridate packages.

library(dplyr)
library(magrittr)
library(lubridate)

Make sure that this file and the jan17Items.csv file are in the same folder and that the working directory is set to that folder.

Read in the jan17Items data as j17i.

j17i <- read.csv('jan17Items.csv')

Aggregating Using dplyr’s summarise and group_by Functions

In the j17i dataframe, every row is a lineitem for a single transaction. Let’s assume that we want to analyze daily sales, and therefore we want to aggregate the data at the date level. Here are the five steps that we will take to do that:

  1. First we will convert the Time column to a datetime format.
  2. Then we will create a column that represents the date without the time.
  3. We will then group the observations by date.
  4. At this point we will indicate what summary values to calculate.
  5. Finally we will ungroup the data.
daily <- j17i %>%
  mutate(
    Time = ymd_hms(Time)
    , date = round_date(Time, 'day')
  ) %>%
  group_by(date) %>%
  summarise(avgCost = mean(Cost, na.rm = T)
            , maxPrice = max(Price, na.rm = T)
            , transactionQuantity = n_distinct(TransactionNumber)) %>%
  ungroup()

Remember to ungroup the data when you’re done with grouping calculations or else you can get errors. Notice that the resulting structure of the daily dataframe object has 30 rows and 4 columns. It retains the grouping column, as well as the three newly computed columns. You can visually explore the data to find out which dates are missing (the first two days in January) and which dates are unexpectedly in the data (February 1).

Concluding Comments

At the risk of being repetitive, I think it’s worth pointing out how easy it is to read the code due to the tidyverse grammar, which reduces the amount of time that you have to type out the dataframe’s name.

While this is a common use case for using the group_by and summarise functions, it’s just a starting point. There are many other ways in which these functions will come in handy. For example, the group_by function can also be used along with the mutate function to add a column that indicates the number of times a cardholder made a purchase. Hopefully this short lesson spurs some insight into other ways that these functions can be used, and what you can do with data.

LS0tDQp0aXRsZTogIkRhdGEgQWdncmVnYXRpb24gYW5kIFN1bW1hcnkiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KVGhpcyBsZXNzb24gaW50cm9kdWNlcyB0d28gZnVuY3Rpb25zIGZyb20gdGhlIGRwbHlyIHBhY2thZ2UgZm9yIGFnZ3JlZ2F0aW5nIGRhdGE6IHRoZSBgZ3JvdXBfYnkoKWAgZnVuY3Rpb24gYW5kIHRoZSBgc3VtbWFyaXNlKClgIGZ1bmN0aW9uLiANCg0KV2Ugd2lsbCBhbHNvIHJldmlldyBob3cgdG8gdXNlIHRoZSBsdWJyaWRhdGUgcGFja2FnZSBmb3IgY29udmVydGluZyBzdHJpbmdzIHRvIGRhdGV0aW1lIHR5cGVzLCBhcyB3ZWxsIGFzIGZvciByb3VuZGluZyBkYXRldGltZSB2YWx1ZXMgdG8gZGF0ZSB2YWx1ZXMuDQoNCkZpbmFsbHksIHdlIHdpbGwgaW50cm9kdWNlIHRoZSBgbl9kaXN0aW5jdCgpYCBmdW5jdGlvbiBmb3IgY2FsY3VsYXRpbmcgdGhlIGRpc3RpbmN0IG51bWJlciBvZiB2YWx1ZXMgZm9yIGRpZmZlcmVudCBncm91cHMuDQoNCiMjIFByZWxpbWluYXJpZXMNCkxvYWQgdGhlIGRwbHlyLCBtYWdyaXR0ciwgYW5kIGx1YnJpZGF0ZSBwYWNrYWdlcy4NCmBgYHtyfQ0KbGlicmFyeShkcGx5cikNCmxpYnJhcnkobWFncml0dHIpDQpsaWJyYXJ5KGx1YnJpZGF0ZSkNCmBgYA0KTWFrZSBzdXJlIHRoYXQgdGhpcyBmaWxlIGFuZCB0aGUgamFuMTdJdGVtcy5jc3YgZmlsZSBhcmUgaW4gdGhlIHNhbWUgZm9sZGVyIGFuZCB0aGF0IHRoZSB3b3JraW5nIGRpcmVjdG9yeSBpcyBzZXQgdG8gdGhhdCBmb2xkZXIuDQoNClJlYWQgaW4gdGhlIGphbjE3SXRlbXMgZGF0YSBhcyBqMTdpLg0KYGBge3J9DQpqMTdpIDwtIHJlYWQuY3N2KCdqYW4xN0l0ZW1zLmNzdicpDQpgYGANCg0KIyMgQWdncmVnYXRpbmcgVXNpbmcgZHBseXIncyBzdW1tYXJpc2UgYW5kIGdyb3VwX2J5IEZ1bmN0aW9ucw0KSW4gdGhlIGoxN2kgZGF0YWZyYW1lLCBldmVyeSByb3cgaXMgYSBsaW5laXRlbSBmb3IgYSBzaW5nbGUgdHJhbnNhY3Rpb24uIExldCdzIGFzc3VtZSB0aGF0IHdlIHdhbnQgdG8gYW5hbHl6ZSBkYWlseSBzYWxlcywgYW5kIHRoZXJlZm9yZSB3ZSB3YW50IHRvIGFnZ3JlZ2F0ZSB0aGUgZGF0YSBhdCB0aGUgZGF0ZSBsZXZlbC4gSGVyZSBhcmUgdGhlIGZpdmUgc3RlcHMgdGhhdCB3ZSB3aWxsIHRha2UgdG8gZG8gdGhhdDoNCg0KMS4gRmlyc3Qgd2Ugd2lsbCBjb252ZXJ0IHRoZSBUaW1lIGNvbHVtbiB0byBhIGRhdGV0aW1lIGZvcm1hdC4NCjIuIFRoZW4gd2Ugd2lsbCBjcmVhdGUgYSBjb2x1bW4gdGhhdCByZXByZXNlbnRzIHRoZSBkYXRlIHdpdGhvdXQgdGhlIHRpbWUuDQozLiBXZSB3aWxsIHRoZW4gZ3JvdXAgdGhlIG9ic2VydmF0aW9ucyBieSBkYXRlLg0KNC4gQXQgdGhpcyBwb2ludCB3ZSB3aWxsIGluZGljYXRlIHdoYXQgc3VtbWFyeSB2YWx1ZXMgdG8gY2FsY3VsYXRlLg0KNS4gRmluYWxseSB3ZSB3aWxsIHVuZ3JvdXAgdGhlIGRhdGEuDQoNCmBgYHtyfQ0KZGFpbHkgPC0gajE3aSAlPiUNCiAgbXV0YXRlKA0KICAgIFRpbWUgPSB5bWRfaG1zKFRpbWUpDQogICAgLCBkYXRlID0gcm91bmRfZGF0ZShUaW1lLCAnZGF5JykNCiAgKSAlPiUNCiAgZ3JvdXBfYnkoZGF0ZSkgJT4lDQogIHN1bW1hcmlzZShhdmdDb3N0ID0gbWVhbihDb3N0LCBuYS5ybSA9IFQpDQogICAgICAgICAgICAsIG1heFByaWNlID0gbWF4KFByaWNlLCBuYS5ybSA9IFQpDQogICAgICAgICAgICAsIHRyYW5zYWN0aW9uUXVhbnRpdHkgPSBuX2Rpc3RpbmN0KFRyYW5zYWN0aW9uTnVtYmVyKSkgJT4lDQogIHVuZ3JvdXAoKQ0KYGBgDQoNClJlbWVtYmVyIHRvIHVuZ3JvdXAgdGhlIGRhdGEgd2hlbiB5b3UncmUgZG9uZSB3aXRoIGdyb3VwaW5nIGNhbGN1bGF0aW9ucyBvciBlbHNlIHlvdSBjYW4gZ2V0IGVycm9ycy4gTm90aWNlIHRoYXQgdGhlIHJlc3VsdGluZyBzdHJ1Y3R1cmUgb2YgdGhlIGRhaWx5IGRhdGFmcmFtZSBvYmplY3QgaGFzIDMwIHJvd3MgYW5kIDQgY29sdW1ucy4gSXQgcmV0YWlucyB0aGUgZ3JvdXBpbmcgY29sdW1uLCBhcyB3ZWxsIGFzIHRoZSB0aHJlZSBuZXdseSBjb21wdXRlZCBjb2x1bW5zLiBZb3UgY2FuIHZpc3VhbGx5IGV4cGxvcmUgdGhlIGRhdGEgdG8gZmluZCBvdXQgd2hpY2ggZGF0ZXMgYXJlIG1pc3NpbmcgKHRoZSBmaXJzdCB0d28gZGF5cyBpbiBKYW51YXJ5KSBhbmQgd2hpY2ggZGF0ZXMgYXJlIHVuZXhwZWN0ZWRseSBpbiB0aGUgZGF0YSAoRmVicnVhcnkgMSkuDQoNCiMjIENvbmNsdWRpbmcgQ29tbWVudHMNCkF0IHRoZSByaXNrIG9mIGJlaW5nIHJlcGV0aXRpdmUsIEkgdGhpbmsgaXQncyB3b3J0aCBwb2ludGluZyBvdXQgaG93IGVhc3kgaXQgaXMgdG8gcmVhZCB0aGUgY29kZSBkdWUgdG8gdGhlIHRpZHl2ZXJzZSBncmFtbWFyLCB3aGljaCByZWR1Y2VzIHRoZSBhbW91bnQgb2YgdGltZSB0aGF0IHlvdSBoYXZlIHRvIHR5cGUgb3V0IHRoZSBkYXRhZnJhbWUncyBuYW1lLg0KDQpXaGlsZSB0aGlzIGlzIGEgY29tbW9uIHVzZSBjYXNlIGZvciB1c2luZyB0aGUgZ3JvdXBfYnkgYW5kIHN1bW1hcmlzZSBmdW5jdGlvbnMsIGl0J3MganVzdCBhIHN0YXJ0aW5nIHBvaW50LiBUaGVyZSBhcmUgbWFueSBvdGhlciB3YXlzIGluIHdoaWNoIHRoZXNlIGZ1bmN0aW9ucyB3aWxsIGNvbWUgaW4gaGFuZHkuIEZvciBleGFtcGxlLCB0aGUgZ3JvdXBfYnkgZnVuY3Rpb24gY2FuIGFsc28gYmUgdXNlZCBhbG9uZyB3aXRoIHRoZSBtdXRhdGUgZnVuY3Rpb24gdG8gYWRkIGEgY29sdW1uIHRoYXQgaW5kaWNhdGVzIHRoZSBudW1iZXIgb2YgdGltZXMgYSBjYXJkaG9sZGVyIG1hZGUgYSBwdXJjaGFzZS4gSG9wZWZ1bGx5IHRoaXMgc2hvcnQgbGVzc29uIHNwdXJzIHNvbWUgaW5zaWdodCBpbnRvIG90aGVyIHdheXMgdGhhdCB0aGVzZSBmdW5jdGlvbnMgY2FuIGJlIHVzZWQsIGFuZCB3aGF0IHlvdSBjYW4gZG8gd2l0aCBkYXRhLg0KDQoNCg0KDQo=