Your first dataset describes online ticket sales for various events across the country. It’s stored as a Comma-Separated Value (CSV) file called sales.csv
sales <- read.csv("sales.csv", stringsAsFactors = FALSE)
The first step when preparing to clean data is to inspect it.
dim(sales)
head(sales)
names(sales)
Notice how the rows appear to represent individual purchases and the columns contain different pieces of information about each purchase.
Let´s summarise the data.
str(sales)
summary(sales)
library(dplyr)
glimpse(sales)
You may have noticed that the first column of data is just a duplication of the row numbers. Not very useful. We can delete that column.
Many of the columns have information that’s of no use to us. For example, the first four columns contain internal codes representing particular events. The last fifteen columns also aren’t worth keeping; there are too many missing values to make them worthwhile.
An easy way to get rid of unnecessary columns is to create a vector containing the column indices you want to keep, then subset the data based on that vector using single bracket subsetting.
keep <- c(5:(ncol(sales2)-15))
sales3 <- sales2[keep]
Some of the columns in your data frame include multiple pieces of information that should be in separate columns. In this exercise, you will separate such a column into two: one for date and one for time.
library(tidyr)
sales4 <- separate(sales3, event_date_time,
c("event_dt", "event_time"), sep = " ")
sales5 <- separate(sales4, sales_ord_create_dttm,
c("ord_create_dt", "ord_create_time"), sep = " ")
Looks like that second call to separate() threw a warning. Not to worry; warnings aren’t as bad as error messages. It’s not saying that the command didn’t execute; it’s just a heads-up that something unusual happened.
The warning says Too few values at 4 locations. You may be able to guess already what the issue is, but it’s still good to take a look.
The locations (i.e. rows) given in the warning are 2516, 3863, 4082, and 4183. Have a look at the contents of the sales_ord_create_dttm column in those rows.
issues <- c(2516, 3863, 4082, 4183)
print(sales3$sales_ord_create_dttm[issues])
print(sales3$sales_ord_create_dttm[2517])
The warning was just because of four missing values. You’ll ignore them for now, but if your analysis depended on complete date/time information, you would probably need to delete those rows.
Some of the columns in your dataset contain dates of different events. Right now, they are stored as character strings. That’s fine if all you want to do is look up the date associated with an event, but if you want to do any comparisons or math with the dates, it’s MUCH easier to store them as Date objects.
Luckily, all of the date columns in this dataset have the substring “dt” in their name, so you can use the str_detect() function of the stringr package to find the date columns. Then you can coerce them to Date objects using a function from the lubridate package.
You’ll use lapply() to apply the appropriate lubridate function to all of the columns that contain dates. Recall the following syntax for lapply() applied to some data frame columns of interest:
lapply(my_data_frame[, cols], function_name)
Also recall that function names in lubridate combine the letters y, m, d, h, m, and s depending on the format of the date/time string being read in.
library(stringr)
date_cols <- str_detect(names(sales5), "dt")
library(lubridate)
sales5[, date_cols] <- lapply(sales5[, date_cols], ymd)
There were a few more warnings…
Some of the calls to ymd() caused a failure to parse warning. That’s probably because of more missing data, but again, it’s good to check to be sure.
See if the number of missing values in each column is the same as the number of rows that failed to parse in the previously.
date_cols <- str_detect(names(sales5), "dt")
missing <- lapply(sales5[, date_cols], is.na)
num_missing <- sapply(missing, sum)
print(num_missing)
It was missing data again.
Sure enough, the number of NAs in each column match the numbers from the warning messages, so missing data is the culprit. How to proceed depends on your desired analysis. If you really need complete sets of date/time information, you might delete the rows or columns containing NAs.
As your last step, you’ll use the tidyr function unite() to combine the venue_city and venue_state columns into one column with the two values separated by a comma and a space. For example, “PORTLAND” “MAINE” should become “PORTLAND, MAINE”.
sales6 <- unite(sales5, venue_city_state, venue_city, venue_state, sep = ", ")
head(sales6)
LS0tDQp0aXRsZTogJ0NsZWFuaW5nIGRhdGE6IHRpY2tldCBzYWxlcycNCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCllvdXIgZmlyc3QgZGF0YXNldCBkZXNjcmliZXMgb25saW5lIHRpY2tldCBzYWxlcyBmb3IgdmFyaW91cyBldmVudHMgYWNyb3NzIHRoZSBjb3VudHJ5LiBJdCdzIHN0b3JlZCBhcyBhIENvbW1hLVNlcGFyYXRlZCBWYWx1ZSAoQ1NWKSBmaWxlIGNhbGxlZCBzYWxlcy5jc3YNCmBgYHtyfQ0KIyBJbXBvcnQgc2FsZXMuY3N2OiBzYWxlcw0Kc2FsZXMgPC0gcmVhZC5jc3YoInNhbGVzLmNzdiIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkNCmBgYA0KVGhlIGZpcnN0IHN0ZXAgd2hlbiBwcmVwYXJpbmcgdG8gY2xlYW4gZGF0YSBpcyB0byBpbnNwZWN0IGl0Lg0KYGBge3J9DQojIFZpZXcgZGltZW5zaW9ucyBvZiBzYWxlcw0KZGltKHNhbGVzKQ0KDQojIEluc3BlY3QgZmlyc3QgNiByb3dzIG9mIHNhbGVzDQpoZWFkKHNhbGVzKQ0KDQojIFZpZXcgY29sdW1uIG5hbWVzIG9mIHNhbGVzDQpuYW1lcyhzYWxlcykNCmBgYA0KTm90aWNlIGhvdyB0aGUgcm93cyBhcHBlYXIgdG8gcmVwcmVzZW50IGluZGl2aWR1YWwgcHVyY2hhc2VzIGFuZCB0aGUgY29sdW1ucyBjb250YWluIGRpZmZlcmVudCBwaWVjZXMgb2YgaW5mb3JtYXRpb24gYWJvdXQgZWFjaCBwdXJjaGFzZS4NCg0KTGV0wrRzIHN1bW1hcmlzZSB0aGUgZGF0YS4NCmBgYHtyfQ0KIyBMb29rIGF0IHN0cnVjdHVyZSBvZiBzYWxlcw0Kc3RyKHNhbGVzKQ0KDQojIFZpZXcgYSBzdW1tYXJ5IG9mIHNhbGVzDQpzdW1tYXJ5KHNhbGVzKQ0KDQojIExvYWQgZHBseXINCmxpYnJhcnkoZHBseXIpDQoNCiMgR2V0IGEgZ2xpbXBzZSBvZiBzYWxlcw0KZ2xpbXBzZShzYWxlcykNCmBgYA0KWW91IG1heSBoYXZlIG5vdGljZWQgdGhhdCB0aGUgZmlyc3QgY29sdW1uIG9mIGRhdGEgaXMganVzdCBhIGR1cGxpY2F0aW9uIG9mIHRoZSByb3cgbnVtYmVycy4gTm90IHZlcnkgdXNlZnVsLiBXZSBjYW4gZGVsZXRlIHRoYXQgY29sdW1uLg0KYGBge3J9DQojIFJlbW92ZSB0aGUgZmlyc3QgY29sdW1uIG9mIHNhbGVzOiBzYWxlczINCnNhbGVzMiA8LSBzYWxlc1stMV0NCmBgYA0KTWFueSBvZiB0aGUgY29sdW1ucyBoYXZlIGluZm9ybWF0aW9uIHRoYXQncyBvZiBubyB1c2UgdG8gdXMuIEZvciBleGFtcGxlLCB0aGUgZmlyc3QgZm91ciBjb2x1bW5zIGNvbnRhaW4gaW50ZXJuYWwgY29kZXMgcmVwcmVzZW50aW5nIHBhcnRpY3VsYXIgZXZlbnRzLiBUaGUgbGFzdCBmaWZ0ZWVuIGNvbHVtbnMgYWxzbyBhcmVuJ3Qgd29ydGgga2VlcGluZzsgdGhlcmUgYXJlIHRvbyBtYW55IG1pc3NpbmcgdmFsdWVzIHRvIG1ha2UgdGhlbSB3b3J0aHdoaWxlLg0KDQpBbiBlYXN5IHdheSB0byBnZXQgcmlkIG9mIHVubmVjZXNzYXJ5IGNvbHVtbnMgaXMgdG8gY3JlYXRlIGEgdmVjdG9yIGNvbnRhaW5pbmcgdGhlIGNvbHVtbiBpbmRpY2VzIHlvdSB3YW50IHRvIGtlZXAsIHRoZW4gc3Vic2V0IHRoZSBkYXRhIGJhc2VkIG9uIHRoYXQgdmVjdG9yIHVzaW5nIHNpbmdsZSBicmFja2V0IHN1YnNldHRpbmcuDQpgYGB7cn0NCiMgRGVmaW5lIGEgdmVjdG9yIG9mIGNvbHVtbiBpbmRpY2VzOiBrZWVwDQprZWVwIDwtIGMoNToobmNvbChzYWxlczIpLTE1KSkNCg0KIyBTdWJzZXQgc2FsZXMyIHVzaW5nIGtlZXA6IHNhbGVzMw0Kc2FsZXMzIDwtIHNhbGVzMltrZWVwXQ0KYGBgDQpTb21lIG9mIHRoZSBjb2x1bW5zIGluIHlvdXIgZGF0YSBmcmFtZSBpbmNsdWRlIG11bHRpcGxlIHBpZWNlcyBvZiBpbmZvcm1hdGlvbiB0aGF0IHNob3VsZCBiZSBpbiBzZXBhcmF0ZSBjb2x1bW5zLiBJbiB0aGlzIGV4ZXJjaXNlLCB5b3Ugd2lsbCBzZXBhcmF0ZSBzdWNoIGEgY29sdW1uIGludG8gdHdvOiBvbmUgZm9yIGRhdGUgYW5kIG9uZSBmb3IgdGltZS4NCmBgYHtyfQ0KIyBMb2FkIHRpZHlyDQpsaWJyYXJ5KHRpZHlyKQ0KDQojIFNwbGl0IGV2ZW50X2RhdGVfdGltZTogc2FsZXM0DQpzYWxlczQgPC0gc2VwYXJhdGUoc2FsZXMzLCBldmVudF9kYXRlX3RpbWUsDQogICAgICAgICAgICAgICAgICAgYygiZXZlbnRfZHQiLCAiZXZlbnRfdGltZSIpLCBzZXAgPSAiICIpDQoNCiMgU3BsaXQgc2FsZXNfb3JkX2NyZWF0ZV9kdHRtOiBzYWxlczUNCnNhbGVzNSA8LSBzZXBhcmF0ZShzYWxlczQsIHNhbGVzX29yZF9jcmVhdGVfZHR0bSwNCiAgICAgICAgICAgICAgICAgICBjKCJvcmRfY3JlYXRlX2R0IiwgIm9yZF9jcmVhdGVfdGltZSIpLCBzZXAgPSAiICIpDQpgYGANCkxvb2tzIGxpa2UgdGhhdCBzZWNvbmQgY2FsbCB0byBzZXBhcmF0ZSgpIHRocmV3IGEgd2FybmluZy4gTm90IHRvIHdvcnJ5OyB3YXJuaW5ncyBhcmVuJ3QgYXMgYmFkIGFzIGVycm9yIG1lc3NhZ2VzLiBJdCdzIG5vdCBzYXlpbmcgdGhhdCB0aGUgY29tbWFuZCBkaWRuJ3QgZXhlY3V0ZTsgaXQncyBqdXN0IGEgaGVhZHMtdXAgdGhhdCBzb21ldGhpbmcgdW51c3VhbCBoYXBwZW5lZC4NCg0KVGhlIHdhcm5pbmcgc2F5cyBUb28gZmV3IHZhbHVlcyBhdCA0IGxvY2F0aW9ucy4gWW91IG1heSBiZSBhYmxlIHRvIGd1ZXNzIGFscmVhZHkgd2hhdCB0aGUgaXNzdWUgaXMsIGJ1dCBpdCdzIHN0aWxsIGdvb2QgdG8gdGFrZSBhIGxvb2suDQoNClRoZSBsb2NhdGlvbnMgKGkuZS4gcm93cykgZ2l2ZW4gaW4gdGhlIHdhcm5pbmcgYXJlIDI1MTYsIDM4NjMsIDQwODIsIGFuZCA0MTgzLiBIYXZlIGEgbG9vayBhdCB0aGUgY29udGVudHMgb2YgdGhlIHNhbGVzX29yZF9jcmVhdGVfZHR0bSBjb2x1bW4gaW4gdGhvc2Ugcm93cy4NCmBgYHtyfQ0KIyBEZWZpbmUgYW4gaXNzdWVzIHZlY3Rvcg0KaXNzdWVzIDwtIGMoMjUxNiwgMzg2MywgNDA4MiwgNDE4MykNCg0KIyBQcmludCB2YWx1ZXMgb2Ygc2FsZXNfb3JkX2NyZWF0ZV9kdHRtIGF0IHRoZXNlIGluZGljZXMNCnByaW50KHNhbGVzMyRzYWxlc19vcmRfY3JlYXRlX2R0dG1baXNzdWVzXSkNCg0KIyBQcmludCBhIHdlbGwtYmVoYXZlZCB2YWx1ZSBvZiBzYWxlc19vcmRfY3JlYXRlX2R0dG0NCnByaW50KHNhbGVzMyRzYWxlc19vcmRfY3JlYXRlX2R0dG1bMjUxN10pDQpgYGANClRoZSB3YXJuaW5nIHdhcyBqdXN0IGJlY2F1c2Ugb2YgZm91ciBtaXNzaW5nIHZhbHVlcy4gWW91J2xsIGlnbm9yZSB0aGVtIGZvciBub3csIGJ1dCBpZiB5b3VyIGFuYWx5c2lzIGRlcGVuZGVkIG9uIGNvbXBsZXRlIGRhdGUvdGltZSBpbmZvcm1hdGlvbiwgeW91IHdvdWxkIHByb2JhYmx5IG5lZWQgdG8gZGVsZXRlIHRob3NlIHJvd3MuDQoNClNvbWUgb2YgdGhlIGNvbHVtbnMgaW4geW91ciBkYXRhc2V0IGNvbnRhaW4gZGF0ZXMgb2YgZGlmZmVyZW50IGV2ZW50cy4gUmlnaHQgbm93LCB0aGV5IGFyZSBzdG9yZWQgYXMgY2hhcmFjdGVyIHN0cmluZ3MuIFRoYXQncyBmaW5lIGlmIGFsbCB5b3Ugd2FudCB0byBkbyBpcyBsb29rIHVwIHRoZSBkYXRlIGFzc29jaWF0ZWQgd2l0aCBhbiBldmVudCwgYnV0IGlmIHlvdSB3YW50IHRvIGRvIGFueSBjb21wYXJpc29ucyBvciBtYXRoIHdpdGggdGhlIGRhdGVzLCBpdCdzIE1VQ0ggZWFzaWVyIHRvIHN0b3JlIHRoZW0gYXMgRGF0ZSBvYmplY3RzLg0KDQpMdWNraWx5LCBhbGwgb2YgdGhlIGRhdGUgY29sdW1ucyBpbiB0aGlzIGRhdGFzZXQgaGF2ZSB0aGUgc3Vic3RyaW5nICJkdCIgaW4gdGhlaXIgbmFtZSwgc28geW91IGNhbiB1c2UgdGhlIHN0cl9kZXRlY3QoKSBmdW5jdGlvbiBvZiB0aGUgc3RyaW5nciBwYWNrYWdlIHRvIGZpbmQgdGhlIGRhdGUgY29sdW1ucy4gVGhlbiB5b3UgY2FuIGNvZXJjZSB0aGVtIHRvIERhdGUgb2JqZWN0cyB1c2luZyBhIGZ1bmN0aW9uIGZyb20gdGhlIGx1YnJpZGF0ZSBwYWNrYWdlLg0KDQpZb3UnbGwgdXNlIGxhcHBseSgpIHRvIGFwcGx5IHRoZSBhcHByb3ByaWF0ZSBsdWJyaWRhdGUgZnVuY3Rpb24gdG8gYWxsIG9mIHRoZSBjb2x1bW5zIHRoYXQgY29udGFpbiBkYXRlcy4gUmVjYWxsIHRoZSBmb2xsb3dpbmcgc3ludGF4IGZvciBsYXBwbHkoKSBhcHBsaWVkIHRvIHNvbWUgZGF0YSBmcmFtZSBjb2x1bW5zIG9mIGludGVyZXN0Og0KDQogICAgbGFwcGx5KG15X2RhdGFfZnJhbWVbLCBjb2xzXSwgZnVuY3Rpb25fbmFtZSkNCiAgICANCkFsc28gcmVjYWxsIHRoYXQgZnVuY3Rpb24gbmFtZXMgaW4gbHVicmlkYXRlIGNvbWJpbmUgdGhlIGxldHRlcnMgeSwgbSwgZCwgaCwgbSwgYW5kIHMgZGVwZW5kaW5nIG9uIHRoZSBmb3JtYXQgb2YgdGhlIGRhdGUvdGltZSBzdHJpbmcgYmVpbmcgcmVhZCBpbi4NCmBgYHtyfQ0KIyBMb2FkIHN0cmluZ3INCmxpYnJhcnkoc3RyaW5ncikNCg0KIyBGaW5kIGNvbHVtbnMgb2Ygc2FsZXM1IGNvbnRhaW5pbmcgImR0IjogZGF0ZV9jb2xzDQpkYXRlX2NvbHMgPC0gc3RyX2RldGVjdChuYW1lcyhzYWxlczUpLCAiZHQiKQ0KDQojIExvYWQgbHVicmlkYXRlDQpsaWJyYXJ5KGx1YnJpZGF0ZSkNCg0KIyBDb2VyY2UgZGF0ZSBjb2x1bW5zIGludG8gRGF0ZSBvYmplY3RzDQpzYWxlczVbLCBkYXRlX2NvbHNdIDwtIGxhcHBseShzYWxlczVbLCBkYXRlX2NvbHNdLCB5bWQpDQpgYGANClRoZXJlIHdlcmUgYSBmZXcgbW9yZSB3YXJuaW5nc+KApiANCg0KU29tZSBvZiB0aGUgY2FsbHMgdG8geW1kKCkgY2F1c2VkIGEgZmFpbHVyZSB0byBwYXJzZSB3YXJuaW5nLiBUaGF0J3MgcHJvYmFibHkgYmVjYXVzZSBvZiBtb3JlIG1pc3NpbmcgZGF0YSwgYnV0IGFnYWluLCBpdCdzIGdvb2QgdG8gY2hlY2sgdG8gYmUgc3VyZS4NCg0KU2VlIGlmIHRoZSBudW1iZXIgb2YgbWlzc2luZyB2YWx1ZXMgaW4gZWFjaCBjb2x1bW4gaXMgdGhlIHNhbWUgYXMgdGhlIG51bWJlciBvZiByb3dzIHRoYXQgZmFpbGVkIHRvIHBhcnNlIGluIHRoZSBwcmV2aW91c2x5Lg0KYGBge3J9DQojIEZpbmQgZGF0ZSBjb2x1bW5zIChkb24ndCBjaGFuZ2UpDQpkYXRlX2NvbHMgPC0gc3RyX2RldGVjdChuYW1lcyhzYWxlczUpLCAiZHQiKQ0KDQojIENyZWF0ZSBsb2dpY2FsIHZlY3RvcnMgaW5kaWNhdGluZyBtaXNzaW5nIHZhbHVlcyAoZG9uJ3QgY2hhbmdlKQ0KbWlzc2luZyA8LSBsYXBwbHkoc2FsZXM1WywgZGF0ZV9jb2xzXSwgaXMubmEpDQoNCiMgQ3JlYXRlIGEgbnVtZXJpY2FsIHZlY3RvciB0aGF0IGNvdW50cyBtaXNzaW5nIHZhbHVlczogbnVtX21pc3NpbmcNCm51bV9taXNzaW5nIDwtIHNhcHBseShtaXNzaW5nLCBzdW0pDQoNCiMgUHJpbnQgbnVtX21pc3NpbmcNCnByaW50KG51bV9taXNzaW5nKQ0KYGBgDQpJdCB3YXMgbWlzc2luZyBkYXRhIGFnYWluLg0KDQpTdXJlIGVub3VnaCwgdGhlIG51bWJlciBvZiBOQXMgaW4gZWFjaCBjb2x1bW4gbWF0Y2ggdGhlIG51bWJlcnMgZnJvbSB0aGUgd2FybmluZyBtZXNzYWdlcywgc28gbWlzc2luZyBkYXRhIGlzIHRoZSBjdWxwcml0LiBIb3cgdG8gcHJvY2VlZCBkZXBlbmRzIG9uIHlvdXIgZGVzaXJlZCBhbmFseXNpcy4gSWYgeW91IHJlYWxseSBuZWVkIGNvbXBsZXRlIHNldHMgb2YgZGF0ZS90aW1lIGluZm9ybWF0aW9uLCB5b3UgbWlnaHQgZGVsZXRlIHRoZSByb3dzIG9yIGNvbHVtbnMgY29udGFpbmluZyBOQXMuDQoNCkFzIHlvdXIgbGFzdCBzdGVwLCB5b3UnbGwgdXNlIHRoZSB0aWR5ciBmdW5jdGlvbiB1bml0ZSgpIHRvIGNvbWJpbmUgdGhlIHZlbnVlX2NpdHkgYW5kIHZlbnVlX3N0YXRlIGNvbHVtbnMgaW50byBvbmUgY29sdW1uIHdpdGggdGhlIHR3byB2YWx1ZXMgc2VwYXJhdGVkIGJ5IGEgY29tbWEgYW5kIGEgc3BhY2UuIEZvciBleGFtcGxlLCAiUE9SVExBTkQiICJNQUlORSIgc2hvdWxkIGJlY29tZSAiUE9SVExBTkQsIE1BSU5FIi4NCmBgYHtyfQ0KIyBDb21iaW5lIHRoZSB2ZW51ZV9jaXR5IGFuZCB2ZW51ZV9zdGF0ZSBjb2x1bW5zDQpzYWxlczYgPC0gdW5pdGUoc2FsZXM1LCB2ZW51ZV9jaXR5X3N0YXRlLCB2ZW51ZV9jaXR5LCB2ZW51ZV9zdGF0ZSwgc2VwID0gIiwgIikNCg0KDQojIFZpZXcgdGhlIGhlYWQgb2Ygc2FsZXM2DQpoZWFkKHNhbGVzNikNCmBgYA0KDQo=