library(tidyverse)
rm(list=ls())Export, Import, Clean, Merge Tutorial
The purpose of this exercise is to illustrate several wrangling concepts covered in class and in DataCamp. We begin by “messing up” the diamonds dataset. Then we will put it all back together. During this exercise, you will see imports, exports, reshapes, joins, mutations, and more!
General housekeeping items
Let’s begin by opening libraries and clearing the environment:
Next, let’s set a working directory. Recall you can do this by selecting Session > Set Working Directory > Choose Directory. Also, your working directory needs to use “/” rather than “\” (alternatively, you can use “\\”). Make sure to add this to your script.
setwd('YOURWD')
Deconstruct the diamonds dataset
Open and store the diamonds dataset after applying a filter on carat weight (less than 1.5 carats) and creating a unique identifier for each diamond. Information about this dataset can be found here. Store in a new dataset - diamond_small:
diamond_small <- diamonds %>%
filter(carat < 1.5) %>%
mutate(uniq_id = row_number())Select the unique identifier and prices, rename the unique identifier, and add extra characters to price. Create a new dataset - diamond_prices:
diamond_prices <- diamond_small %>%
select(id = uniq_id, price) %>%
mutate(price = str_c('amount ', price, ' USD')) Add a few duplicated observations using a random draw from diamond_prices:
set.seed(42)
diamond_prices <- diamond_prices %>%
bind_rows(slice_sample(diamond_prices, n = 55))Notice the “set.seed()” in the code chunk above. This will make the random draw “reproducible.” If you remove this, or change the seed number, you will get a different draw.
Select diamond characteristics excluding price, change all variables to character type, and reshape to long format. Store as a new dataset - diamond_char_long:
diamond_char_long <- diamond_small %>%
select(-price) %>%
mutate(across(everything() , as.character)) %>%
pivot_longer(cols = !uniq_id, names_to = 'variable', values_to = 'measure')Save each of the new datasets in different formats:
diamond_prices %>%
write_csv('diamond_prices.csv')
diamond_char_long %>%
write_delim('diamond_char_long.txt', delim = '\t')Reconstruct the diamonds dataset
Let’s go ahead and clear objects from the environment again:
rm(list=ls())Take a look at your working directory. You should have two files saved from above. Together the files include all of the information from the original dataset, but the data is in raw and messy form. Let’s begin by importing each dataset:
diamond_prices_import <- read_csv('diamond_prices.csv')
diamond_char_import <- read_tsv('diamond_char_long.txt')Take a look at the diamond_prices data.
diamond_prices_import# A tibble: 47,760 × 2
id price
<dbl> <chr>
1 1 amount 326 USD
2 2 amount 326 USD
3 3 amount 327 USD
4 4 amount 334 USD
5 5 amount 335 USD
6 6 amount 336 USD
7 7 amount 336 USD
8 8 amount 337 USD
9 9 amount 337 USD
10 10 amount 338 USD
# ℹ 47,750 more rows
As you can see, the price variable needs to be cleaned. Let’s tackle it!
diamond_prices <- diamond_prices_import %>%
mutate(price = str_replace(price, 'amount ', '')) %>%
separate(price, into = c('price', 'currency'), sep = ' ') %>%
mutate(price = as.numeric(price))Recall that we added some duplicates to the diamond prices dataset… let’s check for duplicates and remove them:
diamond_prices %>%
group_by(id) %>%
filter(n() > 1)# A tibble: 110 × 3
# Groups: id [55]
id price currency
<dbl> <dbl> <chr>
1 16 345 USD
2 103 2760 USD
3 149 2768 USD
4 1907 3080 USD
5 5261 3797 USD
6 5897 3936 USD
7 7453 4235 USD
8 7700 580 USD
9 8826 4480 USD
10 9091 4535 USD
# ℹ 100 more rows
diamond_prices <- diamond_prices %>%
distinct(id, .keep_all = TRUE)Take a look at the diamond_char_import dataset. It clearly needs to be reshaped. Let’s do that first.
diamond_char_import# A tibble: 429,345 × 3
uniq_id variable measure
<dbl> <chr> <chr>
1 1 carat 0.23
2 1 cut Ideal
3 1 color E
4 1 clarity SI2
5 1 depth 61.5
6 1 table 55
7 1 x 3.95
8 1 y 3.98
9 1 z 2.43
10 2 carat 0.21
# ℹ 429,335 more rows
The dataset clearly needs to be reshaped. Let’s do that first.
diamond_char_wide <- diamond_char_import %>%
pivot_wider(id_cols = uniq_id, names_from = variable, values_from = measure)Take a look at the reshaped dataset, diamond_char_wide:
diamond_char_wide# A tibble: 47,705 × 10
uniq_id carat cut color clarity depth table x y z
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 0.23 Ideal E SI2 61.5 55 3.95 3.98 2.43
2 2 0.21 Premium E SI1 59.8 61 3.89 3.84 2.31
3 3 0.23 Good E VS1 56.9 65 4.05 4.07 2.31
4 4 0.29 Premium I VS2 62.4 58 4.2 4.23 2.63
5 5 0.31 Good J SI2 63.3 58 4.34 4.35 2.75
6 6 0.24 Very Good J VVS2 62.8 57 3.94 3.96 2.48
7 7 0.24 Very Good I VVS1 62.3 57 3.95 3.98 2.47
8 8 0.26 Very Good H SI1 61.9 55 4.07 4.11 2.53
9 9 0.22 Fair E VS2 65.1 61 3.87 3.78 2.49
10 10 0.23 Very Good H VS1 59.4 61 4 4.05 2.39
# ℹ 47,695 more rows
That’s better. However, all formatting from the variables is lost. We need to convert some variables to numeric and others to ordered factors:
diamond_char_wide <- diamond_char_wide %>%
mutate(across(c('carat', 'depth', 'table', 'x', 'y', 'z'), as.numeric)) %>%
mutate(clarity = factor(clarity, levels = c('I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF'), ordered = TRUE)) %>%
mutate(color = factor(color, levels = c('D','E','F','G','H','I','J'), ordered = TRUE)) %>%
mutate(cut = factor(cut, levels = c('Fair','Good','Very Good','Premium','Ideal'), ordered = TRUE)) As a sanity check, let’s inspect the dataset for duplicate observations:
diamond_char_wide %>%
group_by(uniq_id) %>%
filter(n() > 1)# A tibble: 0 × 10
# Groups: uniq_id [0]
# ℹ 10 variables: uniq_id <dbl>, carat <dbl>, cut <ord>, color <ord>,
# clarity <ord>, depth <dbl>, table <dbl>, x <dbl>, y <dbl>, z <dbl>
OK… now we are ready to join! Take a look at each dataset and see if you can find a common variable to link the datasets together.
diamonds_recon <- diamond_prices %>%
inner_join(diamond_char_wide, by = c('id' = 'uniq_id'))Compare the reconstructed dataset to the original dataset to check whether some observations do not match the original dataset:
diamonds_recon %>%
anti_join(diamonds)Joining with `by = join_by(price, carat, cut, color, clarity, depth, table, x,
y, z)`
# A tibble: 0 × 12
# ℹ 12 variables: id <dbl>, price <dbl>, currency <chr>, carat <dbl>,
# cut <ord>, color <ord>, clarity <ord>, depth <dbl>, table <dbl>, x <dbl>,
# y <dbl>, z <dbl>
Let’s plot our reconstructed dataset:
diamonds_recon %>%
ggplot(aes(x = carat, y = price, color = clarity)) +
geom_point(alpha = 0.2)