Your first dataset describes online ticket sales for various events across the country. It’s stored as a Comma-Separated Value (CSV) file called sales.csv

# Import sales.csv: sales
sales <- read.csv("sales.csv", stringsAsFactors = FALSE)

The first step when preparing to clean data is to inspect it.

# View dimensions of sales
dim(sales)

# Inspect first 6 rows of sales
head(sales)

# View column names of sales
names(sales)

Notice how the rows appear to represent individual purchases and the columns contain different pieces of information about each purchase.

Let´s summarise the data.

# Look at structure of sales
str(sales)

# View a summary of sales
summary(sales)

# Load dplyr
library(dplyr)

# Get a glimpse of sales
glimpse(sales)

You may have noticed that the first column of data is just a duplication of the row numbers. Not very useful. We can delete that column.

# Remove the first column of sales: sales2
sales2 <- sales[-1]

Many of the columns have information that’s of no use to us. For example, the first four columns contain internal codes representing particular events. The last fifteen columns also aren’t worth keeping; there are too many missing values to make them worthwhile.

An easy way to get rid of unnecessary columns is to create a vector containing the column indices you want to keep, then subset the data based on that vector using single bracket subsetting.

# Define a vector of column indices: keep
keep <- c(5:(ncol(sales2)-15))

# Subset sales2 using keep: sales3
sales3 <- sales2[keep]

Some of the columns in your data frame include multiple pieces of information that should be in separate columns. In this exercise, you will separate such a column into two: one for date and one for time.

# Load tidyr
library(tidyr)

# Split event_date_time: sales4
sales4 <- separate(sales3, event_date_time,
                   c("event_dt", "event_time"), sep = " ")

# Split sales_ord_create_dttm: sales5
sales5 <- separate(sales4, sales_ord_create_dttm,
                   c("ord_create_dt", "ord_create_time"), sep = " ")

Looks like that second call to separate() threw a warning. Not to worry; warnings aren’t as bad as error messages. It’s not saying that the command didn’t execute; it’s just a heads-up that something unusual happened.

The warning says Too few values at 4 locations. You may be able to guess already what the issue is, but it’s still good to take a look.

The locations (i.e. rows) given in the warning are 2516, 3863, 4082, and 4183. Have a look at the contents of the sales_ord_create_dttm column in those rows.

# Define an issues vector
issues <- c(2516, 3863, 4082, 4183)

# Print values of sales_ord_create_dttm at these indices
print(sales3$sales_ord_create_dttm[issues])

# Print a well-behaved value of sales_ord_create_dttm
print(sales3$sales_ord_create_dttm[2517])

The warning was just because of four missing values. You’ll ignore them for now, but if your analysis depended on complete date/time information, you would probably need to delete those rows.

Some of the columns in your dataset contain dates of different events. Right now, they are stored as character strings. That’s fine if all you want to do is look up the date associated with an event, but if you want to do any comparisons or math with the dates, it’s MUCH easier to store them as Date objects.

Luckily, all of the date columns in this dataset have the substring “dt” in their name, so you can use the str_detect() function of the stringr package to find the date columns. Then you can coerce them to Date objects using a function from the lubridate package.

You’ll use lapply() to apply the appropriate lubridate function to all of the columns that contain dates. Recall the following syntax for lapply() applied to some data frame columns of interest:

lapply(my_data_frame[, cols], function_name)

Also recall that function names in lubridate combine the letters y, m, d, h, m, and s depending on the format of the date/time string being read in.

# Load stringr
library(stringr)

# Find columns of sales5 containing "dt": date_cols
date_cols <- str_detect(names(sales5), "dt")

# Load lubridate
library(lubridate)

# Coerce date columns into Date objects
sales5[, date_cols] <- lapply(sales5[, date_cols], ymd)

There were a few more warnings…

Some of the calls to ymd() caused a failure to parse warning. That’s probably because of more missing data, but again, it’s good to check to be sure.

See if the number of missing values in each column is the same as the number of rows that failed to parse in the previously.

# Find date columns (don't change)
date_cols <- str_detect(names(sales5), "dt")

# Create logical vectors indicating missing values (don't change)
missing <- lapply(sales5[, date_cols], is.na)

# Create a numerical vector that counts missing values: num_missing
num_missing <- sapply(missing, sum)

# Print num_missing
print(num_missing)

It was missing data again.

Sure enough, the number of NAs in each column match the numbers from the warning messages, so missing data is the culprit. How to proceed depends on your desired analysis. If you really need complete sets of date/time information, you might delete the rows or columns containing NAs.

As your last step, you’ll use the tidyr function unite() to combine the venue_city and venue_state columns into one column with the two values separated by a comma and a space. For example, “PORTLAND” “MAINE” should become “PORTLAND, MAINE”.

# Combine the venue_city and venue_state columns
sales6 <- unite(sales5, venue_city_state, venue_city, venue_state, sep = ", ")


# View the head of sales6
head(sales6)
LS0tDQp0aXRsZTogJ0NsZWFuaW5nIGRhdGE6IHRpY2tldCBzYWxlcycNCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCllvdXIgZmlyc3QgZGF0YXNldCBkZXNjcmliZXMgb25saW5lIHRpY2tldCBzYWxlcyBmb3IgdmFyaW91cyBldmVudHMgYWNyb3NzIHRoZSBjb3VudHJ5LiBJdCdzIHN0b3JlZCBhcyBhIENvbW1hLVNlcGFyYXRlZCBWYWx1ZSAoQ1NWKSBmaWxlIGNhbGxlZCBzYWxlcy5jc3YNCmBgYHtyfQ0KIyBJbXBvcnQgc2FsZXMuY3N2OiBzYWxlcw0Kc2FsZXMgPC0gcmVhZC5jc3YoInNhbGVzLmNzdiIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkNCmBgYA0KVGhlIGZpcnN0IHN0ZXAgd2hlbiBwcmVwYXJpbmcgdG8gY2xlYW4gZGF0YSBpcyB0byBpbnNwZWN0IGl0Lg0KYGBge3J9DQojIFZpZXcgZGltZW5zaW9ucyBvZiBzYWxlcw0KZGltKHNhbGVzKQ0KDQojIEluc3BlY3QgZmlyc3QgNiByb3dzIG9mIHNhbGVzDQpoZWFkKHNhbGVzKQ0KDQojIFZpZXcgY29sdW1uIG5hbWVzIG9mIHNhbGVzDQpuYW1lcyhzYWxlcykNCmBgYA0KTm90aWNlIGhvdyB0aGUgcm93cyBhcHBlYXIgdG8gcmVwcmVzZW50IGluZGl2aWR1YWwgcHVyY2hhc2VzIGFuZCB0aGUgY29sdW1ucyBjb250YWluIGRpZmZlcmVudCBwaWVjZXMgb2YgaW5mb3JtYXRpb24gYWJvdXQgZWFjaCBwdXJjaGFzZS4NCg0KTGV0wrRzIHN1bW1hcmlzZSB0aGUgZGF0YS4NCmBgYHtyfQ0KIyBMb29rIGF0IHN0cnVjdHVyZSBvZiBzYWxlcw0Kc3RyKHNhbGVzKQ0KDQojIFZpZXcgYSBzdW1tYXJ5IG9mIHNhbGVzDQpzdW1tYXJ5KHNhbGVzKQ0KDQojIExvYWQgZHBseXINCmxpYnJhcnkoZHBseXIpDQoNCiMgR2V0IGEgZ2xpbXBzZSBvZiBzYWxlcw0KZ2xpbXBzZShzYWxlcykNCmBgYA0KWW91IG1heSBoYXZlIG5vdGljZWQgdGhhdCB0aGUgZmlyc3QgY29sdW1uIG9mIGRhdGEgaXMganVzdCBhIGR1cGxpY2F0aW9uIG9mIHRoZSByb3cgbnVtYmVycy4gTm90IHZlcnkgdXNlZnVsLiBXZSBjYW4gZGVsZXRlIHRoYXQgY29sdW1uLg0KYGBge3J9DQojIFJlbW92ZSB0aGUgZmlyc3QgY29sdW1uIG9mIHNhbGVzOiBzYWxlczINCnNhbGVzMiA8LSBzYWxlc1stMV0NCmBgYA0KTWFueSBvZiB0aGUgY29sdW1ucyBoYXZlIGluZm9ybWF0aW9uIHRoYXQncyBvZiBubyB1c2UgdG8gdXMuIEZvciBleGFtcGxlLCB0aGUgZmlyc3QgZm91ciBjb2x1bW5zIGNvbnRhaW4gaW50ZXJuYWwgY29kZXMgcmVwcmVzZW50aW5nIHBhcnRpY3VsYXIgZXZlbnRzLiBUaGUgbGFzdCBmaWZ0ZWVuIGNvbHVtbnMgYWxzbyBhcmVuJ3Qgd29ydGgga2VlcGluZzsgdGhlcmUgYXJlIHRvbyBtYW55IG1pc3NpbmcgdmFsdWVzIHRvIG1ha2UgdGhlbSB3b3J0aHdoaWxlLg0KDQpBbiBlYXN5IHdheSB0byBnZXQgcmlkIG9mIHVubmVjZXNzYXJ5IGNvbHVtbnMgaXMgdG8gY3JlYXRlIGEgdmVjdG9yIGNvbnRhaW5pbmcgdGhlIGNvbHVtbiBpbmRpY2VzIHlvdSB3YW50IHRvIGtlZXAsIHRoZW4gc3Vic2V0IHRoZSBkYXRhIGJhc2VkIG9uIHRoYXQgdmVjdG9yIHVzaW5nIHNpbmdsZSBicmFja2V0IHN1YnNldHRpbmcuDQpgYGB7cn0NCiMgRGVmaW5lIGEgdmVjdG9yIG9mIGNvbHVtbiBpbmRpY2VzOiBrZWVwDQprZWVwIDwtIGMoNToobmNvbChzYWxlczIpLTE1KSkNCg0KIyBTdWJzZXQgc2FsZXMyIHVzaW5nIGtlZXA6IHNhbGVzMw0Kc2FsZXMzIDwtIHNhbGVzMltrZWVwXQ0KYGBgDQpTb21lIG9mIHRoZSBjb2x1bW5zIGluIHlvdXIgZGF0YSBmcmFtZSBpbmNsdWRlIG11bHRpcGxlIHBpZWNlcyBvZiBpbmZvcm1hdGlvbiB0aGF0IHNob3VsZCBiZSBpbiBzZXBhcmF0ZSBjb2x1bW5zLiBJbiB0aGlzIGV4ZXJjaXNlLCB5b3Ugd2lsbCBzZXBhcmF0ZSBzdWNoIGEgY29sdW1uIGludG8gdHdvOiBvbmUgZm9yIGRhdGUgYW5kIG9uZSBmb3IgdGltZS4NCmBgYHtyfQ0KIyBMb2FkIHRpZHlyDQpsaWJyYXJ5KHRpZHlyKQ0KDQojIFNwbGl0IGV2ZW50X2RhdGVfdGltZTogc2FsZXM0DQpzYWxlczQgPC0gc2VwYXJhdGUoc2FsZXMzLCBldmVudF9kYXRlX3RpbWUsDQogICAgICAgICAgICAgICAgICAgYygiZXZlbnRfZHQiLCAiZXZlbnRfdGltZSIpLCBzZXAgPSAiICIpDQoNCiMgU3BsaXQgc2FsZXNfb3JkX2NyZWF0ZV9kdHRtOiBzYWxlczUNCnNhbGVzNSA8LSBzZXBhcmF0ZShzYWxlczQsIHNhbGVzX29yZF9jcmVhdGVfZHR0bSwNCiAgICAgICAgICAgICAgICAgICBjKCJvcmRfY3JlYXRlX2R0IiwgIm9yZF9jcmVhdGVfdGltZSIpLCBzZXAgPSAiICIpDQpgYGANCkxvb2tzIGxpa2UgdGhhdCBzZWNvbmQgY2FsbCB0byBzZXBhcmF0ZSgpIHRocmV3IGEgd2FybmluZy4gTm90IHRvIHdvcnJ5OyB3YXJuaW5ncyBhcmVuJ3QgYXMgYmFkIGFzIGVycm9yIG1lc3NhZ2VzLiBJdCdzIG5vdCBzYXlpbmcgdGhhdCB0aGUgY29tbWFuZCBkaWRuJ3QgZXhlY3V0ZTsgaXQncyBqdXN0IGEgaGVhZHMtdXAgdGhhdCBzb21ldGhpbmcgdW51c3VhbCBoYXBwZW5lZC4NCg0KVGhlIHdhcm5pbmcgc2F5cyBUb28gZmV3IHZhbHVlcyBhdCA0IGxvY2F0aW9ucy4gWW91IG1heSBiZSBhYmxlIHRvIGd1ZXNzIGFscmVhZHkgd2hhdCB0aGUgaXNzdWUgaXMsIGJ1dCBpdCdzIHN0aWxsIGdvb2QgdG8gdGFrZSBhIGxvb2suDQoNClRoZSBsb2NhdGlvbnMgKGkuZS4gcm93cykgZ2l2ZW4gaW4gdGhlIHdhcm5pbmcgYXJlIDI1MTYsIDM4NjMsIDQwODIsIGFuZCA0MTgzLiBIYXZlIGEgbG9vayBhdCB0aGUgY29udGVudHMgb2YgdGhlIHNhbGVzX29yZF9jcmVhdGVfZHR0bSBjb2x1bW4gaW4gdGhvc2Ugcm93cy4NCmBgYHtyfQ0KIyBEZWZpbmUgYW4gaXNzdWVzIHZlY3Rvcg0KaXNzdWVzIDwtIGMoMjUxNiwgMzg2MywgNDA4MiwgNDE4MykNCg0KIyBQcmludCB2YWx1ZXMgb2Ygc2FsZXNfb3JkX2NyZWF0ZV9kdHRtIGF0IHRoZXNlIGluZGljZXMNCnByaW50KHNhbGVzMyRzYWxlc19vcmRfY3JlYXRlX2R0dG1baXNzdWVzXSkNCg0KIyBQcmludCBhIHdlbGwtYmVoYXZlZCB2YWx1ZSBvZiBzYWxlc19vcmRfY3JlYXRlX2R0dG0NCnByaW50KHNhbGVzMyRzYWxlc19vcmRfY3JlYXRlX2R0dG1bMjUxN10pDQpgYGANClRoZSB3YXJuaW5nIHdhcyBqdXN0IGJlY2F1c2Ugb2YgZm91ciBtaXNzaW5nIHZhbHVlcy4gWW91J2xsIGlnbm9yZSB0aGVtIGZvciBub3csIGJ1dCBpZiB5b3VyIGFuYWx5c2lzIGRlcGVuZGVkIG9uIGNvbXBsZXRlIGRhdGUvdGltZSBpbmZvcm1hdGlvbiwgeW91IHdvdWxkIHByb2JhYmx5IG5lZWQgdG8gZGVsZXRlIHRob3NlIHJvd3MuDQoNClNvbWUgb2YgdGhlIGNvbHVtbnMgaW4geW91ciBkYXRhc2V0IGNvbnRhaW4gZGF0ZXMgb2YgZGlmZmVyZW50IGV2ZW50cy4gUmlnaHQgbm93LCB0aGV5IGFyZSBzdG9yZWQgYXMgY2hhcmFjdGVyIHN0cmluZ3MuIFRoYXQncyBmaW5lIGlmIGFsbCB5b3Ugd2FudCB0byBkbyBpcyBsb29rIHVwIHRoZSBkYXRlIGFzc29jaWF0ZWQgd2l0aCBhbiBldmVudCwgYnV0IGlmIHlvdSB3YW50IHRvIGRvIGFueSBjb21wYXJpc29ucyBvciBtYXRoIHdpdGggdGhlIGRhdGVzLCBpdCdzIE1VQ0ggZWFzaWVyIHRvIHN0b3JlIHRoZW0gYXMgRGF0ZSBvYmplY3RzLg0KDQpMdWNraWx5LCBhbGwgb2YgdGhlIGRhdGUgY29sdW1ucyBpbiB0aGlzIGRhdGFzZXQgaGF2ZSB0aGUgc3Vic3RyaW5nICJkdCIgaW4gdGhlaXIgbmFtZSwgc28geW91IGNhbiB1c2UgdGhlIHN0cl9kZXRlY3QoKSBmdW5jdGlvbiBvZiB0aGUgc3RyaW5nciBwYWNrYWdlIHRvIGZpbmQgdGhlIGRhdGUgY29sdW1ucy4gVGhlbiB5b3UgY2FuIGNvZXJjZSB0aGVtIHRvIERhdGUgb2JqZWN0cyB1c2luZyBhIGZ1bmN0aW9uIGZyb20gdGhlIGx1YnJpZGF0ZSBwYWNrYWdlLg0KDQpZb3UnbGwgdXNlIGxhcHBseSgpIHRvIGFwcGx5IHRoZSBhcHByb3ByaWF0ZSBsdWJyaWRhdGUgZnVuY3Rpb24gdG8gYWxsIG9mIHRoZSBjb2x1bW5zIHRoYXQgY29udGFpbiBkYXRlcy4gUmVjYWxsIHRoZSBmb2xsb3dpbmcgc3ludGF4IGZvciBsYXBwbHkoKSBhcHBsaWVkIHRvIHNvbWUgZGF0YSBmcmFtZSBjb2x1bW5zIG9mIGludGVyZXN0Og0KDQogICAgbGFwcGx5KG15X2RhdGFfZnJhbWVbLCBjb2xzXSwgZnVuY3Rpb25fbmFtZSkNCiAgICANCkFsc28gcmVjYWxsIHRoYXQgZnVuY3Rpb24gbmFtZXMgaW4gbHVicmlkYXRlIGNvbWJpbmUgdGhlIGxldHRlcnMgeSwgbSwgZCwgaCwgbSwgYW5kIHMgZGVwZW5kaW5nIG9uIHRoZSBmb3JtYXQgb2YgdGhlIGRhdGUvdGltZSBzdHJpbmcgYmVpbmcgcmVhZCBpbi4NCmBgYHtyfQ0KIyBMb2FkIHN0cmluZ3INCmxpYnJhcnkoc3RyaW5ncikNCg0KIyBGaW5kIGNvbHVtbnMgb2Ygc2FsZXM1IGNvbnRhaW5pbmcgImR0IjogZGF0ZV9jb2xzDQpkYXRlX2NvbHMgPC0gc3RyX2RldGVjdChuYW1lcyhzYWxlczUpLCAiZHQiKQ0KDQojIExvYWQgbHVicmlkYXRlDQpsaWJyYXJ5KGx1YnJpZGF0ZSkNCg0KIyBDb2VyY2UgZGF0ZSBjb2x1bW5zIGludG8gRGF0ZSBvYmplY3RzDQpzYWxlczVbLCBkYXRlX2NvbHNdIDwtIGxhcHBseShzYWxlczVbLCBkYXRlX2NvbHNdLCB5bWQpDQpgYGANClRoZXJlIHdlcmUgYSBmZXcgbW9yZSB3YXJuaW5nc+KApiANCg0KU29tZSBvZiB0aGUgY2FsbHMgdG8geW1kKCkgY2F1c2VkIGEgZmFpbHVyZSB0byBwYXJzZSB3YXJuaW5nLiBUaGF0J3MgcHJvYmFibHkgYmVjYXVzZSBvZiBtb3JlIG1pc3NpbmcgZGF0YSwgYnV0IGFnYWluLCBpdCdzIGdvb2QgdG8gY2hlY2sgdG8gYmUgc3VyZS4NCg0KU2VlIGlmIHRoZSBudW1iZXIgb2YgbWlzc2luZyB2YWx1ZXMgaW4gZWFjaCBjb2x1bW4gaXMgdGhlIHNhbWUgYXMgdGhlIG51bWJlciBvZiByb3dzIHRoYXQgZmFpbGVkIHRvIHBhcnNlIGluIHRoZSBwcmV2aW91c2x5Lg0KYGBge3J9DQojIEZpbmQgZGF0ZSBjb2x1bW5zIChkb24ndCBjaGFuZ2UpDQpkYXRlX2NvbHMgPC0gc3RyX2RldGVjdChuYW1lcyhzYWxlczUpLCAiZHQiKQ0KDQojIENyZWF0ZSBsb2dpY2FsIHZlY3RvcnMgaW5kaWNhdGluZyBtaXNzaW5nIHZhbHVlcyAoZG9uJ3QgY2hhbmdlKQ0KbWlzc2luZyA8LSBsYXBwbHkoc2FsZXM1WywgZGF0ZV9jb2xzXSwgaXMubmEpDQoNCiMgQ3JlYXRlIGEgbnVtZXJpY2FsIHZlY3RvciB0aGF0IGNvdW50cyBtaXNzaW5nIHZhbHVlczogbnVtX21pc3NpbmcNCm51bV9taXNzaW5nIDwtIHNhcHBseShtaXNzaW5nLCBzdW0pDQoNCiMgUHJpbnQgbnVtX21pc3NpbmcNCnByaW50KG51bV9taXNzaW5nKQ0KYGBgDQpJdCB3YXMgbWlzc2luZyBkYXRhIGFnYWluLg0KDQpTdXJlIGVub3VnaCwgdGhlIG51bWJlciBvZiBOQXMgaW4gZWFjaCBjb2x1bW4gbWF0Y2ggdGhlIG51bWJlcnMgZnJvbSB0aGUgd2FybmluZyBtZXNzYWdlcywgc28gbWlzc2luZyBkYXRhIGlzIHRoZSBjdWxwcml0LiBIb3cgdG8gcHJvY2VlZCBkZXBlbmRzIG9uIHlvdXIgZGVzaXJlZCBhbmFseXNpcy4gSWYgeW91IHJlYWxseSBuZWVkIGNvbXBsZXRlIHNldHMgb2YgZGF0ZS90aW1lIGluZm9ybWF0aW9uLCB5b3UgbWlnaHQgZGVsZXRlIHRoZSByb3dzIG9yIGNvbHVtbnMgY29udGFpbmluZyBOQXMuDQoNCkFzIHlvdXIgbGFzdCBzdGVwLCB5b3UnbGwgdXNlIHRoZSB0aWR5ciBmdW5jdGlvbiB1bml0ZSgpIHRvIGNvbWJpbmUgdGhlIHZlbnVlX2NpdHkgYW5kIHZlbnVlX3N0YXRlIGNvbHVtbnMgaW50byBvbmUgY29sdW1uIHdpdGggdGhlIHR3byB2YWx1ZXMgc2VwYXJhdGVkIGJ5IGEgY29tbWEgYW5kIGEgc3BhY2UuIEZvciBleGFtcGxlLCAiUE9SVExBTkQiICJNQUlORSIgc2hvdWxkIGJlY29tZSAiUE9SVExBTkQsIE1BSU5FIi4NCmBgYHtyfQ0KIyBDb21iaW5lIHRoZSB2ZW51ZV9jaXR5IGFuZCB2ZW51ZV9zdGF0ZSBjb2x1bW5zDQpzYWxlczYgPC0gdW5pdGUoc2FsZXM1LCB2ZW51ZV9jaXR5X3N0YXRlLCB2ZW51ZV9jaXR5LCB2ZW51ZV9zdGF0ZSwgc2VwID0gIiwgIikNCg0KDQojIFZpZXcgdGhlIGhlYWQgb2Ygc2FsZXM2DQpoZWFkKHNhbGVzNikNCmBgYA0KDQo=