Some useful insights can be gained when one dataset is analyzed in
the context of another dataset. For instance, if weather is expected to
have an influence on sales, then it may be worth combining the weather
measurements to the point-of-sale data.
Combining two datasets together in this way is typically done with a
join. A join always results in a wider dataframe. Sometimes they can be
used to create a longer dataframe, but that is less common. There are
several types of joins. The one that I tend to use most often is a left
join. A left join means that you keep all of the observations in the
dataframe on the left. Observations on the right are only kept if they
have a matching value in the dataframe on the left.
There are also right joins, inner joins, and full joins. A right join
means that you keep all of the observations in the dataframe on the
right. Only observations on the left are kept if they have a matching
value in the dataframe on the right.
Inner joins only keep observations for which there is a matching
value in both dataframes.
Full joins keep all observations in both the right and left
dataframes even if there are not matching values.
In this lesson we’ll demonstrate each of these joins using the
following dplyr functions: left_join, right_join, inner_join,
full_join.
Preliminaries
Load the dplyr, magrittr, and lubridate packages.
library(dplyr)
library(magrittr)
library(lubridate)
Make sure that this file and the jan17Items.csv and jan17Weather.csv
files are in the same folder and that the working directory is set to
that folder.
Read in the jan17Items and jan17Weather data, and make sure to
convert the date and time columns to a date and time format.
j17i <- read.csv('jan17Items.csv') %>%
mutate(
Time = ymd_hms(Time)
)
j17w <- read.csv('jan17Weather.csv', sep = '\t') %>%
mutate(
date = ymd_hms(date)
)
Left Join on Aggregated Data
For a join to work, we have to identify a column on which to match
the data. This column is known as the primary key, and the values will
only be joined if they values in the primary key column are identical.
So one of the data wrangling tasks associated with joins is making sure
that the values in the primary key column of both datasets are in the
right format.
The obvious option for joining the weather data to the point-of-sale
data is to use the Time and date column. Notice that while the Time
column in the j17i data and the date column in the j17w data are both
the same data type, all of the date values are rounded to the beginning
hour, minute, and second of the day. There are probably few if any
observations at midnight in our dataset.
Therefore, let’s aggregate the point-of-sale data so that there’s one
observation for each day.
j17i %<>% mutate(date = round_date(Time, 'day'))
daily <- j17i %>%
group_by(date) %>%
summarise(avgCost = mean(Cost, na.rm = T)
, maxPrice = max(Price, na.rm = T)
, transactionQuantity = n_distinct(TransactionNumber)) %>%
ungroup()
Notice that this daily dataframe has 30 observations and four
variables. It does not include dates for January 1 and January 2, but
does include an observation for February. In contrast, the weather data
has an observation for each of the 31 days in January, and no
observations for February.
Let’s perform a left join with the point-of-sale data on the left and
the weather data on the right and see what we get:
dailyLeft <- daily %>%
left_join(j17w, by = 'date')
We now have a dailyLeft dataframe with 30 observations and 8
variables. A visual inspection confirms that the weather values are
there only for the dates in the j17i dataframe, which is the left one in
this case because that’s what gets entered into the left_join function
first by the pipe symbol. Because there wasn’t weather data for February
1, there are NA values for the weather data on that date.
Right Join on Aggregated Data
Now let’s keep almost everything the same, except that we’ll replace
the left_join function with the right_join function:
dailyRight <- daily %>%
right_join(j17w, by = 'date')
The dailyRight dataframe has 31 observations because there is weather
data for every day in January, and no observations for data in February.
Notice that the rows for January 1 and Janary 2 have NA values for the
point-of-sale columns, and there is not a value for February 1.
Inner Join on Aggregated Data
Let’s compare the left and right join to an inner join.
dailyInner <- daily %>%
inner_join(j17w, by = 'date')
The dailyInner dataframe has 29 observations, which correspond to the
29 days in January that are in both the point-of-sale and the weather
data.
Full Join on Aggregated Data
Finally, let’s evaluate a full join:
dailyFull <- daily %>%
full_join(j17w, by = 'date')
The dailyFull dataframe keeps all observations in both dataframes.
Naturally, it has the most observations. The 32 observations correspond
to the 31 days in January plus the first day of February. Notice that
there are NA values in the point of sale columns for January 1 and 2,
and NA values in the weather columns for February 1.
LS0tDQp0aXRsZTogIkpvaW5pbmcgRGF0YSINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQpTb21lIHVzZWZ1bCBpbnNpZ2h0cyBjYW4gYmUgZ2FpbmVkIHdoZW4gb25lIGRhdGFzZXQgaXMgYW5hbHl6ZWQgaW4gdGhlIGNvbnRleHQgb2YgYW5vdGhlciBkYXRhc2V0LiBGb3IgaW5zdGFuY2UsIGlmIHdlYXRoZXIgaXMgZXhwZWN0ZWQgdG8gaGF2ZSBhbiBpbmZsdWVuY2Ugb24gc2FsZXMsIHRoZW4gaXQgbWF5IGJlIHdvcnRoIGNvbWJpbmluZyB0aGUgd2VhdGhlciBtZWFzdXJlbWVudHMgdG8gdGhlIHBvaW50LW9mLXNhbGUgZGF0YS4NCg0KQ29tYmluaW5nIHR3byBkYXRhc2V0cyB0b2dldGhlciBpbiB0aGlzIHdheSBpcyB0eXBpY2FsbHkgZG9uZSB3aXRoIGEgam9pbi4gQSBqb2luIGFsd2F5cyByZXN1bHRzIGluIGEgd2lkZXIgZGF0YWZyYW1lLiBTb21ldGltZXMgdGhleSBjYW4gYmUgdXNlZCB0byBjcmVhdGUgYSBsb25nZXIgZGF0YWZyYW1lLCBidXQgdGhhdCBpcyBsZXNzIGNvbW1vbi4gVGhlcmUgYXJlIHNldmVyYWwgdHlwZXMgb2Ygam9pbnMuIFRoZSBvbmUgdGhhdCBJIHRlbmQgdG8gdXNlIG1vc3Qgb2Z0ZW4gaXMgYSBsZWZ0IGpvaW4uIEEgbGVmdCBqb2luIG1lYW5zIHRoYXQgeW91IGtlZXAgYWxsIG9mIHRoZSBvYnNlcnZhdGlvbnMgaW4gdGhlIGRhdGFmcmFtZSBvbiB0aGUgbGVmdC4gT2JzZXJ2YXRpb25zIG9uIHRoZSByaWdodCBhcmUgb25seSBrZXB0IGlmIHRoZXkgaGF2ZSBhIG1hdGNoaW5nIHZhbHVlIGluIHRoZSBkYXRhZnJhbWUgb24gdGhlIGxlZnQuDQoNClRoZXJlIGFyZSBhbHNvIHJpZ2h0IGpvaW5zLCBpbm5lciBqb2lucywgYW5kIGZ1bGwgam9pbnMuIEEgcmlnaHQgam9pbiBtZWFucyB0aGF0IHlvdSBrZWVwIGFsbCBvZiB0aGUgb2JzZXJ2YXRpb25zIGluIHRoZSBkYXRhZnJhbWUgb24gdGhlIHJpZ2h0LiBPbmx5IG9ic2VydmF0aW9ucyBvbiB0aGUgbGVmdCBhcmUga2VwdCBpZiB0aGV5IGhhdmUgYSBtYXRjaGluZyB2YWx1ZSBpbiB0aGUgZGF0YWZyYW1lIG9uIHRoZSByaWdodC4NCg0KSW5uZXIgam9pbnMgb25seSBrZWVwIG9ic2VydmF0aW9ucyBmb3Igd2hpY2ggdGhlcmUgaXMgYSBtYXRjaGluZyB2YWx1ZSBpbiBib3RoIGRhdGFmcmFtZXMuDQoNCkZ1bGwgam9pbnMga2VlcCBhbGwgb2JzZXJ2YXRpb25zIGluIGJvdGggdGhlIHJpZ2h0IGFuZCBsZWZ0IGRhdGFmcmFtZXMgZXZlbiBpZiB0aGVyZSBhcmUgbm90IG1hdGNoaW5nIHZhbHVlcy4NCg0KSW4gdGhpcyBsZXNzb24gd2UnbGwgZGVtb25zdHJhdGUgZWFjaCBvZiB0aGVzZSBqb2lucyB1c2luZyB0aGUgZm9sbG93aW5nIGRwbHlyIGZ1bmN0aW9uczogbGVmdF9qb2luLCByaWdodF9qb2luLCBpbm5lcl9qb2luLCBmdWxsX2pvaW4uDQoNCiMjIFByZWxpbWluYXJpZXMNCkxvYWQgdGhlIGRwbHlyLCAgbWFncml0dHIsIGFuZCBsdWJyaWRhdGUgcGFja2FnZXMuDQpgYGB7cn0NCmxpYnJhcnkoZHBseXIpDQpsaWJyYXJ5KG1hZ3JpdHRyKQ0KbGlicmFyeShsdWJyaWRhdGUpDQpgYGANCk1ha2Ugc3VyZSB0aGF0IHRoaXMgZmlsZSBhbmQgdGhlIGphbjE3SXRlbXMuY3N2IGFuZCBqYW4xN1dlYXRoZXIuY3N2IGZpbGVzIGFyZSBpbiB0aGUgc2FtZSBmb2xkZXIgYW5kIHRoYXQgdGhlIHdvcmtpbmcgZGlyZWN0b3J5IGlzIHNldCB0byB0aGF0IGZvbGRlci4NCg0KUmVhZCBpbiB0aGUgamFuMTdJdGVtcyBhbmQgamFuMTdXZWF0aGVyIGRhdGEsIGFuZCBtYWtlIHN1cmUgdG8gY29udmVydCB0aGUgZGF0ZSBhbmQgdGltZSBjb2x1bW5zIHRvIGEgZGF0ZSBhbmQgdGltZSBmb3JtYXQuDQpgYGB7cn0NCmoxN2kgPC0gcmVhZC5jc3YoJ2phbjE3SXRlbXMuY3N2JykgJT4lDQogIG11dGF0ZSgNCiAgICBUaW1lID0geW1kX2htcyhUaW1lKQ0KICApDQpqMTd3IDwtIHJlYWQuY3N2KCdqYW4xN1dlYXRoZXIuY3N2Jywgc2VwID0gJ1x0JykgJT4lDQogIG11dGF0ZSgNCiAgICBkYXRlID0geW1kX2htcyhkYXRlKQ0KICApDQpgYGANCg0KIyMgTGVmdCBKb2luIG9uIEFnZ3JlZ2F0ZWQgRGF0YQ0KRm9yIGEgam9pbiB0byB3b3JrLCB3ZSBoYXZlIHRvIGlkZW50aWZ5IGEgY29sdW1uIG9uIHdoaWNoIHRvIG1hdGNoIHRoZSBkYXRhLiBUaGlzIGNvbHVtbiBpcyBrbm93biBhcyB0aGUgcHJpbWFyeSBrZXksIGFuZCB0aGUgdmFsdWVzIHdpbGwgb25seSBiZSBqb2luZWQgaWYgdGhleSB2YWx1ZXMgaW4gdGhlIHByaW1hcnkga2V5IGNvbHVtbiBhcmUgaWRlbnRpY2FsLiBTbyBvbmUgb2YgdGhlIGRhdGEgd3JhbmdsaW5nIHRhc2tzIGFzc29jaWF0ZWQgd2l0aCBqb2lucyBpcyBtYWtpbmcgc3VyZSB0aGF0IHRoZSB2YWx1ZXMgaW4gdGhlIHByaW1hcnkga2V5IGNvbHVtbiBvZiBib3RoIGRhdGFzZXRzIGFyZSBpbiB0aGUgcmlnaHQgZm9ybWF0Lg0KDQpUaGUgb2J2aW91cyBvcHRpb24gZm9yIGpvaW5pbmcgdGhlIHdlYXRoZXIgZGF0YSB0byB0aGUgcG9pbnQtb2Ytc2FsZSBkYXRhIGlzIHRvIHVzZSB0aGUgVGltZSBhbmQgZGF0ZSBjb2x1bW4uIE5vdGljZSB0aGF0IHdoaWxlIHRoZSBUaW1lIGNvbHVtbiBpbiB0aGUgajE3aSBkYXRhIGFuZCB0aGUgZGF0ZSBjb2x1bW4gaW4gdGhlIGoxN3cgZGF0YSBhcmUgYm90aCB0aGUgc2FtZSBkYXRhIHR5cGUsIGFsbCBvZiB0aGUgZGF0ZSB2YWx1ZXMgYXJlIHJvdW5kZWQgdG8gdGhlIGJlZ2lubmluZyBob3VyLCBtaW51dGUsIGFuZCBzZWNvbmQgb2YgdGhlIGRheS4gVGhlcmUgYXJlIHByb2JhYmx5IGZldyBpZiBhbnkgb2JzZXJ2YXRpb25zIGF0IG1pZG5pZ2h0IGluIG91ciBkYXRhc2V0Lg0KDQpUaGVyZWZvcmUsIGxldCdzIGFnZ3JlZ2F0ZSB0aGUgcG9pbnQtb2Ytc2FsZSBkYXRhIHNvIHRoYXQgdGhlcmUncyBvbmUgb2JzZXJ2YXRpb24gZm9yIGVhY2ggZGF5Lg0KYGBge3J9DQpqMTdpICU8PiUgbXV0YXRlKGRhdGUgPSByb3VuZF9kYXRlKFRpbWUsICdkYXknKSkNCmRhaWx5IDwtIGoxN2kgJT4lDQogIGdyb3VwX2J5KGRhdGUpICU+JQ0KICBzdW1tYXJpc2UoYXZnQ29zdCA9IG1lYW4oQ29zdCwgbmEucm0gPSBUKQ0KICAgICAgICAgICAgLCBtYXhQcmljZSA9IG1heChQcmljZSwgbmEucm0gPSBUKQ0KICAgICAgICAgICAgLCB0cmFuc2FjdGlvblF1YW50aXR5ID0gbl9kaXN0aW5jdChUcmFuc2FjdGlvbk51bWJlcikpICU+JQ0KICB1bmdyb3VwKCkNCmBgYA0KTm90aWNlIHRoYXQgdGhpcyBkYWlseSBkYXRhZnJhbWUgaGFzIDMwIG9ic2VydmF0aW9ucyBhbmQgZm91ciB2YXJpYWJsZXMuIEl0IGRvZXMgbm90IGluY2x1ZGUgZGF0ZXMgZm9yIEphbnVhcnkgMSBhbmQgSmFudWFyeSAyLCBidXQgZG9lcyBpbmNsdWRlIGFuIG9ic2VydmF0aW9uIGZvciBGZWJydWFyeS4gSW4gY29udHJhc3QsIHRoZSB3ZWF0aGVyIGRhdGEgaGFzIGFuIG9ic2VydmF0aW9uIGZvciBlYWNoIG9mIHRoZSAzMSBkYXlzIGluIEphbnVhcnksIGFuZCBubyBvYnNlcnZhdGlvbnMgZm9yIEZlYnJ1YXJ5LiANCg0KTGV0J3MgcGVyZm9ybSBhIGxlZnQgam9pbiB3aXRoIHRoZSBwb2ludC1vZi1zYWxlIGRhdGEgb24gdGhlIGxlZnQgYW5kIHRoZSB3ZWF0aGVyIGRhdGEgb24gdGhlIHJpZ2h0IGFuZCBzZWUgd2hhdCB3ZSBnZXQ6DQpgYGB7cn0NCmRhaWx5TGVmdCA8LSBkYWlseSAlPiUNCiAgbGVmdF9qb2luKGoxN3csIGJ5ID0gJ2RhdGUnKQ0KYGBgDQoNCldlIG5vdyBoYXZlIGEgZGFpbHlMZWZ0IGRhdGFmcmFtZSB3aXRoIDMwIG9ic2VydmF0aW9ucyBhbmQgOCB2YXJpYWJsZXMuIEEgdmlzdWFsIGluc3BlY3Rpb24gY29uZmlybXMgdGhhdCB0aGUgd2VhdGhlciB2YWx1ZXMgYXJlIHRoZXJlIG9ubHkgZm9yIHRoZSBkYXRlcyBpbiB0aGUgajE3aSBkYXRhZnJhbWUsIHdoaWNoIGlzIHRoZSBsZWZ0IG9uZSBpbiB0aGlzIGNhc2UgYmVjYXVzZSB0aGF0J3Mgd2hhdCBnZXRzIGVudGVyZWQgaW50byB0aGUgbGVmdF9qb2luIGZ1bmN0aW9uIGZpcnN0IGJ5IHRoZSBwaXBlIHN5bWJvbC4gQmVjYXVzZSB0aGVyZSB3YXNuJ3Qgd2VhdGhlciBkYXRhIGZvciBGZWJydWFyeSAxLCB0aGVyZSBhcmUgTkEgdmFsdWVzIGZvciB0aGUgd2VhdGhlciBkYXRhIG9uIHRoYXQgZGF0ZS4NCg0KIyMgUmlnaHQgSm9pbiBvbiBBZ2dyZWdhdGVkIERhdGENCk5vdyBsZXQncyBrZWVwIGFsbW9zdCBldmVyeXRoaW5nIHRoZSBzYW1lLCBleGNlcHQgdGhhdCB3ZSdsbCByZXBsYWNlIHRoZSBsZWZ0X2pvaW4gZnVuY3Rpb24gd2l0aCB0aGUgcmlnaHRfam9pbiBmdW5jdGlvbjoNCmBgYHtyfQ0KZGFpbHlSaWdodCA8LSBkYWlseSAlPiUNCiAgcmlnaHRfam9pbihqMTd3LCBieSA9ICdkYXRlJykNCmBgYA0KVGhlIGRhaWx5UmlnaHQgZGF0YWZyYW1lIGhhcyAzMSBvYnNlcnZhdGlvbnMgYmVjYXVzZSB0aGVyZSBpcyB3ZWF0aGVyIGRhdGEgZm9yIGV2ZXJ5IGRheSBpbiBKYW51YXJ5LCBhbmQgbm8gb2JzZXJ2YXRpb25zIGZvciBkYXRhIGluIEZlYnJ1YXJ5LiBOb3RpY2UgdGhhdCB0aGUgcm93cyBmb3IgSmFudWFyeSAxIGFuZCBKYW5hcnkgMiBoYXZlIE5BIHZhbHVlcyBmb3IgdGhlIHBvaW50LW9mLXNhbGUgY29sdW1ucywgYW5kIHRoZXJlIGlzIG5vdCBhIHZhbHVlIGZvciBGZWJydWFyeSAxLg0KDQojIyBJbm5lciBKb2luIG9uIEFnZ3JlZ2F0ZWQgRGF0YQ0KTGV0J3MgY29tcGFyZSB0aGUgbGVmdCBhbmQgcmlnaHQgam9pbiB0byBhbiBpbm5lciBqb2luLg0KYGBge3J9DQpkYWlseUlubmVyIDwtIGRhaWx5ICU+JQ0KICBpbm5lcl9qb2luKGoxN3csIGJ5ID0gJ2RhdGUnKQ0KYGBgDQpUaGUgZGFpbHlJbm5lciBkYXRhZnJhbWUgaGFzIDI5IG9ic2VydmF0aW9ucywgd2hpY2ggY29ycmVzcG9uZCB0byB0aGUgMjkgZGF5cyBpbiBKYW51YXJ5IHRoYXQgYXJlIGluIGJvdGggdGhlIHBvaW50LW9mLXNhbGUgYW5kIHRoZSB3ZWF0aGVyIGRhdGEuDQoNCiMjIEZ1bGwgSm9pbiBvbiBBZ2dyZWdhdGVkIERhdGENCkZpbmFsbHksIGxldCdzIGV2YWx1YXRlIGEgZnVsbCBqb2luOg0KYGBge3J9DQpkYWlseUZ1bGwgPC0gZGFpbHkgJT4lDQogIGZ1bGxfam9pbihqMTd3LCBieSA9ICdkYXRlJykNCmBgYA0KVGhlIGRhaWx5RnVsbCBkYXRhZnJhbWUga2VlcHMgYWxsIG9ic2VydmF0aW9ucyBpbiBib3RoIGRhdGFmcmFtZXMuIE5hdHVyYWxseSwgaXQgaGFzIHRoZSBtb3N0IG9ic2VydmF0aW9ucy4gVGhlIDMyIG9ic2VydmF0aW9ucyBjb3JyZXNwb25kIHRvIHRoZSAzMSBkYXlzIGluIEphbnVhcnkgcGx1cyB0aGUgZmlyc3QgZGF5IG9mIEZlYnJ1YXJ5LiBOb3RpY2UgdGhhdCB0aGVyZSBhcmUgTkEgdmFsdWVzIGluIHRoZSBwb2ludCBvZiBzYWxlIGNvbHVtbnMgZm9yIEphbnVhcnkgMSBhbmQgMiwgYW5kIE5BIHZhbHVlcyBpbiB0aGUgd2VhdGhlciBjb2x1bW5zIGZvciBGZWJydWFyeSAxLg0KDQojIyBDb25jbHVkaW5nIENvbW1lbnRzDQpJbiBjb25jbHVzaW9uLCBqb2lucyBhcmUgcmVhbGx5IGVhc3kgdG8gZG8uIEFzIHlvdSB1c2UgdGhlbSBtb3JlLCB5b3UnbGwgZmluZCBvdGhlciBvcHRpb25zIGFuZCBvdGhlciB0eXBlcyBvZiBqb2lucy4gSSByZWNvbW1lbmQgdGhhdCB5b3UgdGhpbmsgY2FyZWZ1bGx5IGFib3V0IHdoYXQgdGhlIHNoYXBlIG9mIHRoZSBjb21iaW5lZCBkYXRhZnJhbWUgc2hvdWxkIGJlLiBJZiB0aGUgbnVtYmVyIG9mIHJvd3MgZ3Jvd3MgdW5leHBlY3RlZGx5LCB0aGVuIGl0IHByb2JhYmx5IG1lYW5zIHRoYXQgeW91IGhhdmUgZHVwbGljYXRlIHZhbHVlcyBpbiBvbmUgb2YgdGhlIGRhdGFmcmFtZXMuIFlvdSBlbmQgdXAgbGVhcm5pbmcgYSBsb3QgYWJvdXQgeW91ciBkYXRhIGFzIHlvdSBwcmVwYXJlIHRvIHBlcmZvcm0gam9pbnMuDQoNCg==