This lesson introduces two functions from the dplyr package for
aggregating data: the group_by() function and the
summarise() function.
We will also review how to use the lubridate package for converting
strings to datetime types, as well as for rounding datetime values to
date values.
Finally, we will introduce the n_distinct() function for
calculating the distinct number of values for different groups.
Preliminaries
Load the dplyr, magrittr, and lubridate packages.
library(dplyr)
library(magrittr)
library(lubridate)
Make sure that this file and the jan17Items.csv file are in the same
folder and that the working directory is set to that folder.
Read in the jan17Items data as j17i.
j17i <- read.csv('jan17Items.csv')
Aggregating Using dplyr’s summarise and group_by Functions
In the j17i dataframe, every row is a lineitem for a single
transaction. Let’s assume that we want to analyze daily sales, and
therefore we want to aggregate the data at the date level. Here are the
five steps that we will take to do that:
- First we will convert the Time column to a datetime format.
- Then we will create a column that represents the date without the
time.
- We will then group the observations by date.
- At this point we will indicate what summary values to
calculate.
- Finally we will ungroup the data.
daily <- j17i %>%
mutate(
Time = ymd_hms(Time)
, date = round_date(Time, 'day')
) %>%
group_by(date) %>%
summarise(avgCost = mean(Cost, na.rm = T)
, maxPrice = max(Price, na.rm = T)
, transactionQuantity = n_distinct(TransactionNumber)) %>%
ungroup()
Remember to ungroup the data when you’re done with grouping
calculations or else you can get errors. Notice that the resulting
structure of the daily dataframe object has 30 rows and 4 columns. It
retains the grouping column, as well as the three newly computed
columns. You can visually explore the data to find out which dates are
missing (the first two days in January) and which dates are unexpectedly
in the data (February 1).
LS0tDQp0aXRsZTogIkRhdGEgQWdncmVnYXRpb24gYW5kIFN1bW1hcnkiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KVGhpcyBsZXNzb24gaW50cm9kdWNlcyB0d28gZnVuY3Rpb25zIGZyb20gdGhlIGRwbHlyIHBhY2thZ2UgZm9yIGFnZ3JlZ2F0aW5nIGRhdGE6IHRoZSBgZ3JvdXBfYnkoKWAgZnVuY3Rpb24gYW5kIHRoZSBgc3VtbWFyaXNlKClgIGZ1bmN0aW9uLiANCg0KV2Ugd2lsbCBhbHNvIHJldmlldyBob3cgdG8gdXNlIHRoZSBsdWJyaWRhdGUgcGFja2FnZSBmb3IgY29udmVydGluZyBzdHJpbmdzIHRvIGRhdGV0aW1lIHR5cGVzLCBhcyB3ZWxsIGFzIGZvciByb3VuZGluZyBkYXRldGltZSB2YWx1ZXMgdG8gZGF0ZSB2YWx1ZXMuDQoNCkZpbmFsbHksIHdlIHdpbGwgaW50cm9kdWNlIHRoZSBgbl9kaXN0aW5jdCgpYCBmdW5jdGlvbiBmb3IgY2FsY3VsYXRpbmcgdGhlIGRpc3RpbmN0IG51bWJlciBvZiB2YWx1ZXMgZm9yIGRpZmZlcmVudCBncm91cHMuDQoNCiMjIFByZWxpbWluYXJpZXMNCkxvYWQgdGhlIGRwbHlyLCBtYWdyaXR0ciwgYW5kIGx1YnJpZGF0ZSBwYWNrYWdlcy4NCmBgYHtyfQ0KbGlicmFyeShkcGx5cikNCmxpYnJhcnkobWFncml0dHIpDQpsaWJyYXJ5KGx1YnJpZGF0ZSkNCmBgYA0KTWFrZSBzdXJlIHRoYXQgdGhpcyBmaWxlIGFuZCB0aGUgamFuMTdJdGVtcy5jc3YgZmlsZSBhcmUgaW4gdGhlIHNhbWUgZm9sZGVyIGFuZCB0aGF0IHRoZSB3b3JraW5nIGRpcmVjdG9yeSBpcyBzZXQgdG8gdGhhdCBmb2xkZXIuDQoNClJlYWQgaW4gdGhlIGphbjE3SXRlbXMgZGF0YSBhcyBqMTdpLg0KYGBge3J9DQpqMTdpIDwtIHJlYWQuY3N2KCdqYW4xN0l0ZW1zLmNzdicpDQpgYGANCg0KIyMgQWdncmVnYXRpbmcgVXNpbmcgZHBseXIncyBzdW1tYXJpc2UgYW5kIGdyb3VwX2J5IEZ1bmN0aW9ucw0KSW4gdGhlIGoxN2kgZGF0YWZyYW1lLCBldmVyeSByb3cgaXMgYSBsaW5laXRlbSBmb3IgYSBzaW5nbGUgdHJhbnNhY3Rpb24uIExldCdzIGFzc3VtZSB0aGF0IHdlIHdhbnQgdG8gYW5hbHl6ZSBkYWlseSBzYWxlcywgYW5kIHRoZXJlZm9yZSB3ZSB3YW50IHRvIGFnZ3JlZ2F0ZSB0aGUgZGF0YSBhdCB0aGUgZGF0ZSBsZXZlbC4gSGVyZSBhcmUgdGhlIGZpdmUgc3RlcHMgdGhhdCB3ZSB3aWxsIHRha2UgdG8gZG8gdGhhdDoNCg0KMS4gRmlyc3Qgd2Ugd2lsbCBjb252ZXJ0IHRoZSBUaW1lIGNvbHVtbiB0byBhIGRhdGV0aW1lIGZvcm1hdC4NCjIuIFRoZW4gd2Ugd2lsbCBjcmVhdGUgYSBjb2x1bW4gdGhhdCByZXByZXNlbnRzIHRoZSBkYXRlIHdpdGhvdXQgdGhlIHRpbWUuDQozLiBXZSB3aWxsIHRoZW4gZ3JvdXAgdGhlIG9ic2VydmF0aW9ucyBieSBkYXRlLg0KNC4gQXQgdGhpcyBwb2ludCB3ZSB3aWxsIGluZGljYXRlIHdoYXQgc3VtbWFyeSB2YWx1ZXMgdG8gY2FsY3VsYXRlLg0KNS4gRmluYWxseSB3ZSB3aWxsIHVuZ3JvdXAgdGhlIGRhdGEuDQoNCmBgYHtyfQ0KZGFpbHkgPC0gajE3aSAlPiUNCiAgbXV0YXRlKA0KICAgIFRpbWUgPSB5bWRfaG1zKFRpbWUpDQogICAgLCBkYXRlID0gcm91bmRfZGF0ZShUaW1lLCAnZGF5JykNCiAgKSAlPiUNCiAgZ3JvdXBfYnkoZGF0ZSkgJT4lDQogIHN1bW1hcmlzZShhdmdDb3N0ID0gbWVhbihDb3N0LCBuYS5ybSA9IFQpDQogICAgICAgICAgICAsIG1heFByaWNlID0gbWF4KFByaWNlLCBuYS5ybSA9IFQpDQogICAgICAgICAgICAsIHRyYW5zYWN0aW9uUXVhbnRpdHkgPSBuX2Rpc3RpbmN0KFRyYW5zYWN0aW9uTnVtYmVyKSkgJT4lDQogIHVuZ3JvdXAoKQ0KYGBgDQoNClJlbWVtYmVyIHRvIHVuZ3JvdXAgdGhlIGRhdGEgd2hlbiB5b3UncmUgZG9uZSB3aXRoIGdyb3VwaW5nIGNhbGN1bGF0aW9ucyBvciBlbHNlIHlvdSBjYW4gZ2V0IGVycm9ycy4gTm90aWNlIHRoYXQgdGhlIHJlc3VsdGluZyBzdHJ1Y3R1cmUgb2YgdGhlIGRhaWx5IGRhdGFmcmFtZSBvYmplY3QgaGFzIDMwIHJvd3MgYW5kIDQgY29sdW1ucy4gSXQgcmV0YWlucyB0aGUgZ3JvdXBpbmcgY29sdW1uLCBhcyB3ZWxsIGFzIHRoZSB0aHJlZSBuZXdseSBjb21wdXRlZCBjb2x1bW5zLiBZb3UgY2FuIHZpc3VhbGx5IGV4cGxvcmUgdGhlIGRhdGEgdG8gZmluZCBvdXQgd2hpY2ggZGF0ZXMgYXJlIG1pc3NpbmcgKHRoZSBmaXJzdCB0d28gZGF5cyBpbiBKYW51YXJ5KSBhbmQgd2hpY2ggZGF0ZXMgYXJlIHVuZXhwZWN0ZWRseSBpbiB0aGUgZGF0YSAoRmVicnVhcnkgMSkuDQoNCiMjIENvbmNsdWRpbmcgQ29tbWVudHMNCkF0IHRoZSByaXNrIG9mIGJlaW5nIHJlcGV0aXRpdmUsIEkgdGhpbmsgaXQncyB3b3J0aCBwb2ludGluZyBvdXQgaG93IGVhc3kgaXQgaXMgdG8gcmVhZCB0aGUgY29kZSBkdWUgdG8gdGhlIHRpZHl2ZXJzZSBncmFtbWFyLCB3aGljaCByZWR1Y2VzIHRoZSBhbW91bnQgb2YgdGltZSB0aGF0IHlvdSBoYXZlIHRvIHR5cGUgb3V0IHRoZSBkYXRhZnJhbWUncyBuYW1lLg0KDQpXaGlsZSB0aGlzIGlzIGEgY29tbW9uIHVzZSBjYXNlIGZvciB1c2luZyB0aGUgZ3JvdXBfYnkgYW5kIHN1bW1hcmlzZSBmdW5jdGlvbnMsIGl0J3MganVzdCBhIHN0YXJ0aW5nIHBvaW50LiBUaGVyZSBhcmUgbWFueSBvdGhlciB3YXlzIGluIHdoaWNoIHRoZXNlIGZ1bmN0aW9ucyB3aWxsIGNvbWUgaW4gaGFuZHkuIEZvciBleGFtcGxlLCB0aGUgZ3JvdXBfYnkgZnVuY3Rpb24gY2FuIGFsc28gYmUgdXNlZCBhbG9uZyB3aXRoIHRoZSBtdXRhdGUgZnVuY3Rpb24gdG8gYWRkIGEgY29sdW1uIHRoYXQgaW5kaWNhdGVzIHRoZSBudW1iZXIgb2YgdGltZXMgYSBjYXJkaG9sZGVyIG1hZGUgYSBwdXJjaGFzZS4gSG9wZWZ1bGx5IHRoaXMgc2hvcnQgbGVzc29uIHNwdXJzIHNvbWUgaW5zaWdodCBpbnRvIG90aGVyIHdheXMgdGhhdCB0aGVzZSBmdW5jdGlvbnMgY2FuIGJlIHVzZWQsIGFuZCB3aGF0IHlvdSBjYW4gZG8gd2l0aCBkYXRhLg0KDQoNCg0KDQo=