Some useful insights can be gained when one dataset is analyzed in the context of another dataset. For instance, if weather is expected to have an influence on sales, then it may be worth combining the weather measurements to the point-of-sale data.

Combining two datasets together in this way is typically done with a join. A join always results in a wider dataframe. Sometimes they can be used to create a longer dataframe, but that is less common. There are several types of joins. The one that I tend to use most often is a left join. A left join means that you keep all of the observations in the dataframe on the left. Observations on the right are only kept if they have a matching value in the dataframe on the left.

There are also right joins, inner joins, and full joins. A right join means that you keep all of the observations in the dataframe on the right. Only observations on the left are kept if they have a matching value in the dataframe on the right.

Inner joins only keep observations for which there is a matching value in both dataframes.

Full joins keep all observations in both the right and left dataframes even if there are not matching values.

In this lesson we’ll demonstrate each of these joins using the following dplyr functions: left_join, right_join, inner_join, full_join.

Preliminaries

Load the dplyr, magrittr, and lubridate packages.

library(dplyr)
library(magrittr)
library(lubridate)

Make sure that this file and the jan17Items.csv and jan17Weather.csv files are in the same folder and that the working directory is set to that folder.

Read in the jan17Items and jan17Weather data, and make sure to convert the date and time columns to a date and time format.

j17i <- read.csv('jan17Items.csv') %>%
  mutate(
    Time = ymd_hms(Time)
  )
j17w <- read.csv('jan17Weather.csv', sep = '\t') %>%
  mutate(
    date = ymd_hms(date)
  )

Left Join on Aggregated Data

For a join to work, we have to identify a column on which to match the data. This column is known as the primary key, and the values will only be joined if they values in the primary key column are identical. So one of the data wrangling tasks associated with joins is making sure that the values in the primary key column of both datasets are in the right format.

The obvious option for joining the weather data to the point-of-sale data is to use the Time and date column. Notice that while the Time column in the j17i data and the date column in the j17w data are both the same data type, all of the date values are rounded to the beginning hour, minute, and second of the day. There are probably few if any observations at midnight in our dataset.

Therefore, let’s aggregate the point-of-sale data so that there’s one observation for each day.

j17i %<>% mutate(date = round_date(Time, 'day'))
daily <- j17i %>%
  group_by(date) %>%
  summarise(avgCost = mean(Cost, na.rm = T)
            , maxPrice = max(Price, na.rm = T)
            , transactionQuantity = n_distinct(TransactionNumber)) %>%
  ungroup()

Notice that this daily dataframe has 30 observations and four variables. It does not include dates for January 1 and January 2, but does include an observation for February. In contrast, the weather data has an observation for each of the 31 days in January, and no observations for February.

Let’s perform a left join with the point-of-sale data on the left and the weather data on the right and see what we get:

dailyLeft <- daily %>%
  left_join(j17w, by = 'date')

We now have a dailyLeft dataframe with 30 observations and 8 variables. A visual inspection confirms that the weather values are there only for the dates in the j17i dataframe, which is the left one in this case because that’s what gets entered into the left_join function first by the pipe symbol. Because there wasn’t weather data for February 1, there are NA values for the weather data on that date.

Right Join on Aggregated Data

Now let’s keep almost everything the same, except that we’ll replace the left_join function with the right_join function:

dailyRight <- daily %>%
  right_join(j17w, by = 'date')

The dailyRight dataframe has 31 observations because there is weather data for every day in January, and no observations for data in February. Notice that the rows for January 1 and Janary 2 have NA values for the point-of-sale columns, and there is not a value for February 1.

Inner Join on Aggregated Data

Let’s compare the left and right join to an inner join.

dailyInner <- daily %>%
  inner_join(j17w, by = 'date')

The dailyInner dataframe has 29 observations, which correspond to the 29 days in January that are in both the point-of-sale and the weather data.

Full Join on Aggregated Data

Finally, let’s evaluate a full join:

dailyFull <- daily %>%
  full_join(j17w, by = 'date')

The dailyFull dataframe keeps all observations in both dataframes. Naturally, it has the most observations. The 32 observations correspond to the 31 days in January plus the first day of February. Notice that there are NA values in the point of sale columns for January 1 and 2, and NA values in the weather columns for February 1.

Concluding Comments

In conclusion, joins are really easy to do. As you use them more, you’ll find other options and other types of joins. I recommend that you think carefully about what the shape of the combined dataframe should be. If the number of rows grows unexpectedly, then it probably means that you have duplicate values in one of the dataframes. You end up learning a lot about your data as you prepare to perform joins.

LS0tDQp0aXRsZTogIkpvaW5pbmcgRGF0YSINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQpTb21lIHVzZWZ1bCBpbnNpZ2h0cyBjYW4gYmUgZ2FpbmVkIHdoZW4gb25lIGRhdGFzZXQgaXMgYW5hbHl6ZWQgaW4gdGhlIGNvbnRleHQgb2YgYW5vdGhlciBkYXRhc2V0LiBGb3IgaW5zdGFuY2UsIGlmIHdlYXRoZXIgaXMgZXhwZWN0ZWQgdG8gaGF2ZSBhbiBpbmZsdWVuY2Ugb24gc2FsZXMsIHRoZW4gaXQgbWF5IGJlIHdvcnRoIGNvbWJpbmluZyB0aGUgd2VhdGhlciBtZWFzdXJlbWVudHMgdG8gdGhlIHBvaW50LW9mLXNhbGUgZGF0YS4NCg0KQ29tYmluaW5nIHR3byBkYXRhc2V0cyB0b2dldGhlciBpbiB0aGlzIHdheSBpcyB0eXBpY2FsbHkgZG9uZSB3aXRoIGEgam9pbi4gQSBqb2luIGFsd2F5cyByZXN1bHRzIGluIGEgd2lkZXIgZGF0YWZyYW1lLiBTb21ldGltZXMgdGhleSBjYW4gYmUgdXNlZCB0byBjcmVhdGUgYSBsb25nZXIgZGF0YWZyYW1lLCBidXQgdGhhdCBpcyBsZXNzIGNvbW1vbi4gVGhlcmUgYXJlIHNldmVyYWwgdHlwZXMgb2Ygam9pbnMuIFRoZSBvbmUgdGhhdCBJIHRlbmQgdG8gdXNlIG1vc3Qgb2Z0ZW4gaXMgYSBsZWZ0IGpvaW4uIEEgbGVmdCBqb2luIG1lYW5zIHRoYXQgeW91IGtlZXAgYWxsIG9mIHRoZSBvYnNlcnZhdGlvbnMgaW4gdGhlIGRhdGFmcmFtZSBvbiB0aGUgbGVmdC4gT2JzZXJ2YXRpb25zIG9uIHRoZSByaWdodCBhcmUgb25seSBrZXB0IGlmIHRoZXkgaGF2ZSBhIG1hdGNoaW5nIHZhbHVlIGluIHRoZSBkYXRhZnJhbWUgb24gdGhlIGxlZnQuDQoNClRoZXJlIGFyZSBhbHNvIHJpZ2h0IGpvaW5zLCBpbm5lciBqb2lucywgYW5kIGZ1bGwgam9pbnMuIEEgcmlnaHQgam9pbiBtZWFucyB0aGF0IHlvdSBrZWVwIGFsbCBvZiB0aGUgb2JzZXJ2YXRpb25zIGluIHRoZSBkYXRhZnJhbWUgb24gdGhlIHJpZ2h0LiBPbmx5IG9ic2VydmF0aW9ucyBvbiB0aGUgbGVmdCBhcmUga2VwdCBpZiB0aGV5IGhhdmUgYSBtYXRjaGluZyB2YWx1ZSBpbiB0aGUgZGF0YWZyYW1lIG9uIHRoZSByaWdodC4NCg0KSW5uZXIgam9pbnMgb25seSBrZWVwIG9ic2VydmF0aW9ucyBmb3Igd2hpY2ggdGhlcmUgaXMgYSBtYXRjaGluZyB2YWx1ZSBpbiBib3RoIGRhdGFmcmFtZXMuDQoNCkZ1bGwgam9pbnMga2VlcCBhbGwgb2JzZXJ2YXRpb25zIGluIGJvdGggdGhlIHJpZ2h0IGFuZCBsZWZ0IGRhdGFmcmFtZXMgZXZlbiBpZiB0aGVyZSBhcmUgbm90IG1hdGNoaW5nIHZhbHVlcy4NCg0KSW4gdGhpcyBsZXNzb24gd2UnbGwgZGVtb25zdHJhdGUgZWFjaCBvZiB0aGVzZSBqb2lucyB1c2luZyB0aGUgZm9sbG93aW5nIGRwbHlyIGZ1bmN0aW9uczogbGVmdF9qb2luLCByaWdodF9qb2luLCBpbm5lcl9qb2luLCBmdWxsX2pvaW4uDQoNCiMjIFByZWxpbWluYXJpZXMNCkxvYWQgdGhlIGRwbHlyLCAgbWFncml0dHIsIGFuZCBsdWJyaWRhdGUgcGFja2FnZXMuDQpgYGB7cn0NCmxpYnJhcnkoZHBseXIpDQpsaWJyYXJ5KG1hZ3JpdHRyKQ0KbGlicmFyeShsdWJyaWRhdGUpDQpgYGANCk1ha2Ugc3VyZSB0aGF0IHRoaXMgZmlsZSBhbmQgdGhlIGphbjE3SXRlbXMuY3N2IGFuZCBqYW4xN1dlYXRoZXIuY3N2IGZpbGVzIGFyZSBpbiB0aGUgc2FtZSBmb2xkZXIgYW5kIHRoYXQgdGhlIHdvcmtpbmcgZGlyZWN0b3J5IGlzIHNldCB0byB0aGF0IGZvbGRlci4NCg0KUmVhZCBpbiB0aGUgamFuMTdJdGVtcyBhbmQgamFuMTdXZWF0aGVyIGRhdGEsIGFuZCBtYWtlIHN1cmUgdG8gY29udmVydCB0aGUgZGF0ZSBhbmQgdGltZSBjb2x1bW5zIHRvIGEgZGF0ZSBhbmQgdGltZSBmb3JtYXQuDQpgYGB7cn0NCmoxN2kgPC0gcmVhZC5jc3YoJ2phbjE3SXRlbXMuY3N2JykgJT4lDQogIG11dGF0ZSgNCiAgICBUaW1lID0geW1kX2htcyhUaW1lKQ0KICApDQpqMTd3IDwtIHJlYWQuY3N2KCdqYW4xN1dlYXRoZXIuY3N2Jywgc2VwID0gJ1x0JykgJT4lDQogIG11dGF0ZSgNCiAgICBkYXRlID0geW1kX2htcyhkYXRlKQ0KICApDQpgYGANCg0KIyMgTGVmdCBKb2luIG9uIEFnZ3JlZ2F0ZWQgRGF0YQ0KRm9yIGEgam9pbiB0byB3b3JrLCB3ZSBoYXZlIHRvIGlkZW50aWZ5IGEgY29sdW1uIG9uIHdoaWNoIHRvIG1hdGNoIHRoZSBkYXRhLiBUaGlzIGNvbHVtbiBpcyBrbm93biBhcyB0aGUgcHJpbWFyeSBrZXksIGFuZCB0aGUgdmFsdWVzIHdpbGwgb25seSBiZSBqb2luZWQgaWYgdGhleSB2YWx1ZXMgaW4gdGhlIHByaW1hcnkga2V5IGNvbHVtbiBhcmUgaWRlbnRpY2FsLiBTbyBvbmUgb2YgdGhlIGRhdGEgd3JhbmdsaW5nIHRhc2tzIGFzc29jaWF0ZWQgd2l0aCBqb2lucyBpcyBtYWtpbmcgc3VyZSB0aGF0IHRoZSB2YWx1ZXMgaW4gdGhlIHByaW1hcnkga2V5IGNvbHVtbiBvZiBib3RoIGRhdGFzZXRzIGFyZSBpbiB0aGUgcmlnaHQgZm9ybWF0Lg0KDQpUaGUgb2J2aW91cyBvcHRpb24gZm9yIGpvaW5pbmcgdGhlIHdlYXRoZXIgZGF0YSB0byB0aGUgcG9pbnQtb2Ytc2FsZSBkYXRhIGlzIHRvIHVzZSB0aGUgVGltZSBhbmQgZGF0ZSBjb2x1bW4uIE5vdGljZSB0aGF0IHdoaWxlIHRoZSBUaW1lIGNvbHVtbiBpbiB0aGUgajE3aSBkYXRhIGFuZCB0aGUgZGF0ZSBjb2x1bW4gaW4gdGhlIGoxN3cgZGF0YSBhcmUgYm90aCB0aGUgc2FtZSBkYXRhIHR5cGUsIGFsbCBvZiB0aGUgZGF0ZSB2YWx1ZXMgYXJlIHJvdW5kZWQgdG8gdGhlIGJlZ2lubmluZyBob3VyLCBtaW51dGUsIGFuZCBzZWNvbmQgb2YgdGhlIGRheS4gVGhlcmUgYXJlIHByb2JhYmx5IGZldyBpZiBhbnkgb2JzZXJ2YXRpb25zIGF0IG1pZG5pZ2h0IGluIG91ciBkYXRhc2V0Lg0KDQpUaGVyZWZvcmUsIGxldCdzIGFnZ3JlZ2F0ZSB0aGUgcG9pbnQtb2Ytc2FsZSBkYXRhIHNvIHRoYXQgdGhlcmUncyBvbmUgb2JzZXJ2YXRpb24gZm9yIGVhY2ggZGF5Lg0KYGBge3J9DQpqMTdpICU8PiUgbXV0YXRlKGRhdGUgPSByb3VuZF9kYXRlKFRpbWUsICdkYXknKSkNCmRhaWx5IDwtIGoxN2kgJT4lDQogIGdyb3VwX2J5KGRhdGUpICU+JQ0KICBzdW1tYXJpc2UoYXZnQ29zdCA9IG1lYW4oQ29zdCwgbmEucm0gPSBUKQ0KICAgICAgICAgICAgLCBtYXhQcmljZSA9IG1heChQcmljZSwgbmEucm0gPSBUKQ0KICAgICAgICAgICAgLCB0cmFuc2FjdGlvblF1YW50aXR5ID0gbl9kaXN0aW5jdChUcmFuc2FjdGlvbk51bWJlcikpICU+JQ0KICB1bmdyb3VwKCkNCmBgYA0KTm90aWNlIHRoYXQgdGhpcyBkYWlseSBkYXRhZnJhbWUgaGFzIDMwIG9ic2VydmF0aW9ucyBhbmQgZm91ciB2YXJpYWJsZXMuIEl0IGRvZXMgbm90IGluY2x1ZGUgZGF0ZXMgZm9yIEphbnVhcnkgMSBhbmQgSmFudWFyeSAyLCBidXQgZG9lcyBpbmNsdWRlIGFuIG9ic2VydmF0aW9uIGZvciBGZWJydWFyeS4gSW4gY29udHJhc3QsIHRoZSB3ZWF0aGVyIGRhdGEgaGFzIGFuIG9ic2VydmF0aW9uIGZvciBlYWNoIG9mIHRoZSAzMSBkYXlzIGluIEphbnVhcnksIGFuZCBubyBvYnNlcnZhdGlvbnMgZm9yIEZlYnJ1YXJ5LiANCg0KTGV0J3MgcGVyZm9ybSBhIGxlZnQgam9pbiB3aXRoIHRoZSBwb2ludC1vZi1zYWxlIGRhdGEgb24gdGhlIGxlZnQgYW5kIHRoZSB3ZWF0aGVyIGRhdGEgb24gdGhlIHJpZ2h0IGFuZCBzZWUgd2hhdCB3ZSBnZXQ6DQpgYGB7cn0NCmRhaWx5TGVmdCA8LSBkYWlseSAlPiUNCiAgbGVmdF9qb2luKGoxN3csIGJ5ID0gJ2RhdGUnKQ0KYGBgDQoNCldlIG5vdyBoYXZlIGEgZGFpbHlMZWZ0IGRhdGFmcmFtZSB3aXRoIDMwIG9ic2VydmF0aW9ucyBhbmQgOCB2YXJpYWJsZXMuIEEgdmlzdWFsIGluc3BlY3Rpb24gY29uZmlybXMgdGhhdCB0aGUgd2VhdGhlciB2YWx1ZXMgYXJlIHRoZXJlIG9ubHkgZm9yIHRoZSBkYXRlcyBpbiB0aGUgajE3aSBkYXRhZnJhbWUsIHdoaWNoIGlzIHRoZSBsZWZ0IG9uZSBpbiB0aGlzIGNhc2UgYmVjYXVzZSB0aGF0J3Mgd2hhdCBnZXRzIGVudGVyZWQgaW50byB0aGUgbGVmdF9qb2luIGZ1bmN0aW9uIGZpcnN0IGJ5IHRoZSBwaXBlIHN5bWJvbC4gQmVjYXVzZSB0aGVyZSB3YXNuJ3Qgd2VhdGhlciBkYXRhIGZvciBGZWJydWFyeSAxLCB0aGVyZSBhcmUgTkEgdmFsdWVzIGZvciB0aGUgd2VhdGhlciBkYXRhIG9uIHRoYXQgZGF0ZS4NCg0KIyMgUmlnaHQgSm9pbiBvbiBBZ2dyZWdhdGVkIERhdGENCk5vdyBsZXQncyBrZWVwIGFsbW9zdCBldmVyeXRoaW5nIHRoZSBzYW1lLCBleGNlcHQgdGhhdCB3ZSdsbCByZXBsYWNlIHRoZSBsZWZ0X2pvaW4gZnVuY3Rpb24gd2l0aCB0aGUgcmlnaHRfam9pbiBmdW5jdGlvbjoNCmBgYHtyfQ0KZGFpbHlSaWdodCA8LSBkYWlseSAlPiUNCiAgcmlnaHRfam9pbihqMTd3LCBieSA9ICdkYXRlJykNCmBgYA0KVGhlIGRhaWx5UmlnaHQgZGF0YWZyYW1lIGhhcyAzMSBvYnNlcnZhdGlvbnMgYmVjYXVzZSB0aGVyZSBpcyB3ZWF0aGVyIGRhdGEgZm9yIGV2ZXJ5IGRheSBpbiBKYW51YXJ5LCBhbmQgbm8gb2JzZXJ2YXRpb25zIGZvciBkYXRhIGluIEZlYnJ1YXJ5LiBOb3RpY2UgdGhhdCB0aGUgcm93cyBmb3IgSmFudWFyeSAxIGFuZCBKYW5hcnkgMiBoYXZlIE5BIHZhbHVlcyBmb3IgdGhlIHBvaW50LW9mLXNhbGUgY29sdW1ucywgYW5kIHRoZXJlIGlzIG5vdCBhIHZhbHVlIGZvciBGZWJydWFyeSAxLg0KDQojIyBJbm5lciBKb2luIG9uIEFnZ3JlZ2F0ZWQgRGF0YQ0KTGV0J3MgY29tcGFyZSB0aGUgbGVmdCBhbmQgcmlnaHQgam9pbiB0byBhbiBpbm5lciBqb2luLg0KYGBge3J9DQpkYWlseUlubmVyIDwtIGRhaWx5ICU+JQ0KICBpbm5lcl9qb2luKGoxN3csIGJ5ID0gJ2RhdGUnKQ0KYGBgDQpUaGUgZGFpbHlJbm5lciBkYXRhZnJhbWUgaGFzIDI5IG9ic2VydmF0aW9ucywgd2hpY2ggY29ycmVzcG9uZCB0byB0aGUgMjkgZGF5cyBpbiBKYW51YXJ5IHRoYXQgYXJlIGluIGJvdGggdGhlIHBvaW50LW9mLXNhbGUgYW5kIHRoZSB3ZWF0aGVyIGRhdGEuDQoNCiMjIEZ1bGwgSm9pbiBvbiBBZ2dyZWdhdGVkIERhdGENCkZpbmFsbHksIGxldCdzIGV2YWx1YXRlIGEgZnVsbCBqb2luOg0KYGBge3J9DQpkYWlseUZ1bGwgPC0gZGFpbHkgJT4lDQogIGZ1bGxfam9pbihqMTd3LCBieSA9ICdkYXRlJykNCmBgYA0KVGhlIGRhaWx5RnVsbCBkYXRhZnJhbWUga2VlcHMgYWxsIG9ic2VydmF0aW9ucyBpbiBib3RoIGRhdGFmcmFtZXMuIE5hdHVyYWxseSwgaXQgaGFzIHRoZSBtb3N0IG9ic2VydmF0aW9ucy4gVGhlIDMyIG9ic2VydmF0aW9ucyBjb3JyZXNwb25kIHRvIHRoZSAzMSBkYXlzIGluIEphbnVhcnkgcGx1cyB0aGUgZmlyc3QgZGF5IG9mIEZlYnJ1YXJ5LiBOb3RpY2UgdGhhdCB0aGVyZSBhcmUgTkEgdmFsdWVzIGluIHRoZSBwb2ludCBvZiBzYWxlIGNvbHVtbnMgZm9yIEphbnVhcnkgMSBhbmQgMiwgYW5kIE5BIHZhbHVlcyBpbiB0aGUgd2VhdGhlciBjb2x1bW5zIGZvciBGZWJydWFyeSAxLg0KDQojIyBDb25jbHVkaW5nIENvbW1lbnRzDQpJbiBjb25jbHVzaW9uLCBqb2lucyBhcmUgcmVhbGx5IGVhc3kgdG8gZG8uIEFzIHlvdSB1c2UgdGhlbSBtb3JlLCB5b3UnbGwgZmluZCBvdGhlciBvcHRpb25zIGFuZCBvdGhlciB0eXBlcyBvZiBqb2lucy4gSSByZWNvbW1lbmQgdGhhdCB5b3UgdGhpbmsgY2FyZWZ1bGx5IGFib3V0IHdoYXQgdGhlIHNoYXBlIG9mIHRoZSBjb21iaW5lZCBkYXRhZnJhbWUgc2hvdWxkIGJlLiBJZiB0aGUgbnVtYmVyIG9mIHJvd3MgZ3Jvd3MgdW5leHBlY3RlZGx5LCB0aGVuIGl0IHByb2JhYmx5IG1lYW5zIHRoYXQgeW91IGhhdmUgZHVwbGljYXRlIHZhbHVlcyBpbiBvbmUgb2YgdGhlIGRhdGFmcmFtZXMuIFlvdSBlbmQgdXAgbGVhcm5pbmcgYSBsb3QgYWJvdXQgeW91ciBkYXRhIGFzIHlvdSBwcmVwYXJlIHRvIHBlcmZvcm0gam9pbnMuDQoNCg==