Export, Import, Clean, Merge Tutorial

The purpose of this exercise is to illustrate several wrangling concepts covered in class and in DataCamp. We begin by “messing up” the diamonds dataset. Then we will put it all back together. During this exercise, you will see imports, exports, reshapes, joins, mutations, and more!

General housekeeping items

Let’s begin by opening libraries and clearing the environment:

library(tidyverse)
rm(list=ls())

Next, let’s set a working directory. Recall you can do this by selecting Session > Set Working Directory > Choose Directory. Also, your working directory needs to use “/” rather than “\” (alternatively, you can use “\\”). Make sure to add this to your script.

setwd('YOURWD')

Deconstruct the diamonds dataset

Open and store the diamonds dataset after applying a filter on carat weight (less than 1.5 carats) and creating a unique identifier for each diamond. Information about this dataset can be found here. Store in a new dataset - diamond_small:

diamond_small <- diamonds %>%
  filter(carat < 1.5) %>%
  mutate(uniq_id = row_number())

Select the unique identifier and prices, rename the unique identifier, and add extra characters to price. Create a new dataset - diamond_prices:

diamond_prices <- diamond_small %>%
  select(id = uniq_id, price) %>%
  mutate(price = str_c('amount ', price, ' USD')) 

Add a few duplicated observations using a random draw from diamond_prices:

set.seed(42)
diamond_prices <- diamond_prices %>%
  bind_rows(slice_sample(diamond_prices, n = 55))
Setting seed

Notice the “set.seed()” in the code chunk above. This will make the random draw “reproducible.” If you remove this, or change the seed number, you will get a different draw.

Select diamond characteristics excluding price, change all variables to character type, and reshape to long format. Store as a new dataset - diamond_char_long:

diamond_char_long <- diamond_small %>%
  select(-price) %>%
  mutate(across(everything() , as.character)) %>%
  pivot_longer(cols = !uniq_id, names_to = 'variable', values_to = 'measure')

Save each of the new datasets in different formats:

diamond_prices %>%
  write_csv('diamond_prices.csv')

diamond_char_long %>%
  write_delim('diamond_char_long.txt', delim = '\t')

Reconstruct the diamonds dataset

Let’s go ahead and clear objects from the environment again:

rm(list=ls())

Take a look at your working directory. You should have two files saved from above. Together the files include all of the information from the original dataset, but the data is in raw and messy form. Let’s begin by importing each dataset:

diamond_prices_import <- read_csv('diamond_prices.csv')
diamond_char_import <- read_tsv('diamond_char_long.txt')

Take a look at the diamond_prices data.

diamond_prices_import
# A tibble: 47,760 × 2
      id price         
   <dbl> <chr>         
 1     1 amount 326 USD
 2     2 amount 326 USD
 3     3 amount 327 USD
 4     4 amount 334 USD
 5     5 amount 335 USD
 6     6 amount 336 USD
 7     7 amount 336 USD
 8     8 amount 337 USD
 9     9 amount 337 USD
10    10 amount 338 USD
# ℹ 47,750 more rows

As you can see, the price variable needs to be cleaned. Let’s tackle it!

diamond_prices <- diamond_prices_import %>%
  mutate(price = str_replace(price, 'amount ', '')) %>% 
  separate(price, into = c('price', 'currency'), sep = ' ') %>% 
  mutate(price = as.numeric(price))

Recall that we added some duplicates to the diamond prices dataset… let’s check for duplicates and remove them:

diamond_prices %>%
  group_by(id) %>%
  filter(n() > 1)
# A tibble: 110 × 3
# Groups:   id [55]
      id price currency
   <dbl> <dbl> <chr>   
 1    16   345 USD     
 2   103  2760 USD     
 3   149  2768 USD     
 4  1907  3080 USD     
 5  5261  3797 USD     
 6  5897  3936 USD     
 7  7453  4235 USD     
 8  7700   580 USD     
 9  8826  4480 USD     
10  9091  4535 USD     
# ℹ 100 more rows
diamond_prices <- diamond_prices %>%
  distinct(id, .keep_all = TRUE)

Take a look at the diamond_char_import dataset. It clearly needs to be reshaped. Let’s do that first.

diamond_char_import
# A tibble: 429,345 × 3
   uniq_id variable measure
     <dbl> <chr>    <chr>  
 1       1 carat    0.23   
 2       1 cut      Ideal  
 3       1 color    E      
 4       1 clarity  SI2    
 5       1 depth    61.5   
 6       1 table    55     
 7       1 x        3.95   
 8       1 y        3.98   
 9       1 z        2.43   
10       2 carat    0.21   
# ℹ 429,335 more rows

The dataset clearly needs to be reshaped. Let’s do that first.

diamond_char_wide <- diamond_char_import %>% 
  pivot_wider(id_cols = uniq_id, names_from = variable, values_from = measure)

Take a look at the reshaped dataset, diamond_char_wide:

diamond_char_wide
# A tibble: 47,705 × 10
   uniq_id carat cut       color clarity depth table x     y     z    
     <dbl> <chr> <chr>     <chr> <chr>   <chr> <chr> <chr> <chr> <chr>
 1       1 0.23  Ideal     E     SI2     61.5  55    3.95  3.98  2.43 
 2       2 0.21  Premium   E     SI1     59.8  61    3.89  3.84  2.31 
 3       3 0.23  Good      E     VS1     56.9  65    4.05  4.07  2.31 
 4       4 0.29  Premium   I     VS2     62.4  58    4.2   4.23  2.63 
 5       5 0.31  Good      J     SI2     63.3  58    4.34  4.35  2.75 
 6       6 0.24  Very Good J     VVS2    62.8  57    3.94  3.96  2.48 
 7       7 0.24  Very Good I     VVS1    62.3  57    3.95  3.98  2.47 
 8       8 0.26  Very Good H     SI1     61.9  55    4.07  4.11  2.53 
 9       9 0.22  Fair      E     VS2     65.1  61    3.87  3.78  2.49 
10      10 0.23  Very Good H     VS1     59.4  61    4     4.05  2.39 
# ℹ 47,695 more rows

That’s better. However, all formatting from the variables is lost. We need to convert some variables to numeric and others to ordered factors:

diamond_char_wide <- diamond_char_wide %>%
  mutate(across(c('carat', 'depth', 'table', 'x', 'y', 'z'), as.numeric)) %>%
  mutate(clarity = factor(clarity, levels = c('I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF'), ordered = TRUE)) %>%
  mutate(color = factor(color, levels = c('D','E','F','G','H','I','J'), ordered = TRUE)) %>%
  mutate(cut = factor(cut, levels = c('Fair','Good','Very Good','Premium','Ideal'), ordered = TRUE)) 

As a sanity check, let’s inspect the dataset for duplicate observations:

diamond_char_wide %>%
  group_by(uniq_id) %>%
  filter(n() > 1)
# A tibble: 0 × 10
# Groups:   uniq_id [0]
# ℹ 10 variables: uniq_id <dbl>, carat <dbl>, cut <ord>, color <ord>,
#   clarity <ord>, depth <dbl>, table <dbl>, x <dbl>, y <dbl>, z <dbl>

OK… now we are ready to join! Take a look at each dataset and see if you can find a common variable to link the datasets together.

diamonds_recon <- diamond_prices %>%
  inner_join(diamond_char_wide, by = c('id' = 'uniq_id'))

Compare the reconstructed dataset to the original dataset to check whether some observations do not match the original dataset:

diamonds_recon %>%
  anti_join(diamonds)
Joining with `by = join_by(price, carat, cut, color, clarity, depth, table, x,
y, z)`
# A tibble: 0 × 12
# ℹ 12 variables: id <dbl>, price <dbl>, currency <chr>, carat <dbl>,
#   cut <ord>, color <ord>, clarity <ord>, depth <dbl>, table <dbl>, x <dbl>,
#   y <dbl>, z <dbl>

Let’s plot our reconstructed dataset:

diamonds_recon %>%
  ggplot(aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.2)