Sales Exercise

In this exercise, you will perform a simulated auditing task with data on invoices, customer orders, and shipments.

Overview and setting

Imagine you are asked to work on a new audit client that distributes bath bombs and cowbells (my daughter’s favorite toys at the time I created this exercise). Your manager has provided you with 3 related datasets containing invoice, customer order, and shipment information. Your task is to apply R functions to explore the data and identify observations that require further testing.

General housekeeping items

Let’s begin by opening libraries and clearing the environment:

library(tidyverse)
library(lubridate)
library(scales)

rm(list=ls())

Set your working directory:

setwd('C:/YOURWD')

Download and import the sales exercise datasets

On the course website, you will see three files ( in .txt, and .csv formats). Save each of these to your working directory. Also, review the “data sheet” containing a description of each data set. Next, import the datasets using the “base r” imports (e.g., read.csv). We will not use readr imports here (but will elsewhere in the course).

invoices <- read.csv('invoices.csv', stringsAsFactors = FALSE)
customer_orders <- read.table('customer_orders.txt', header = TRUE, stringsAsFactors = FALSE)
shipping_data <- read.csv('shipping_data.csv', stringsAsFactors = FALSE)

Inspect and clean the sales datasets

After reviewing the data sheet, inspect each dataset carefully. Think about your “strategy” for wrangling each dataset. Consider how each dataset needs to be cleaned, formatted, or reshaped.

Clean and process the invoices dataset:

invoices_data_clean <- invoices %>%
  pivot_wider(id_cols = everything(), names_from = item, values_from = amount)

Clean and process the customer orders dataset:

customer_orders_data_clean <- customer_orders %>%
  mutate(customer_order_date = ymd(customer_order_date))

Clean and process the shipping dataset:

shipping_data_clean <- shipping_data %>%
  mutate(across(c('shipping_date', 'delivery_date'), ymd)) %>%
  separate(customer_id_invoice, c('customer_id', 'invoice_num'), sep = '-') %>%
  mutate(invoice_num = as.numeric(str_trim(invoice_num)))

Join and filter the sales datasets

Use join functions to create a list of invoices that did not have an associated shipment (save in a new dataset called ‘non_shipped_invoices’):

non_shipped_invoices <- invoices_data_clean %>%
  anti_join(shipping_data_clean, by = 'invoice_num')

Use join functions to combine all three datasets together as a new dataset called complete_sales_data. Make sure this dataset retains only observations that match across all three datasets:

complete_sales_data <- invoices_data_clean %>%
  inner_join(customer_orders_data_clean, by = 'customer_order') %>%
  inner_join(shipping_data_clean, by = 'invoice_num')

Using the dataset created above, can you isolate the invoices that may have been recorded in the wrong year? Hint: Look at the shipping dates and shipping terms. (save in a new dataset called ‘wrong_period_sales’):

wrong_period_sales <- complete_sales_data %>%
  filter((terms == 'Destination' & year(delivery_date) != 2018) | 
         (terms == 'Shipping Point' & year(shipping_date) != 2018))

Create a variable in complete_sales_data called ‘inv_month’ that represents the month of the invoice date:

complete_sales_data <- complete_sales_data %>%
  mutate(inv_month = month(invoice_date))

Create a summary of the total value of all invoices (‘invoice_amount’) for each month (‘inv_month’) during 2018. Store in ‘sales_per_month’:

sales_per_month <- complete_sales_data %>%
  group_by(inv_month) %>%
  summarize(total_sales = sum(invoice_amount))

Generate visualizations

Use a line chart to visualize trends in monthly sales using ‘sales_per_month’:

ggplot(sales_per_month, aes(x = inv_month, y = total_sales)) +
  geom_line() +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma, breaks = seq(1,12,1))

Use a scatterplot to visualize the relation between shipping weight and invoice amount - try techniques to avoid “overplotting”:

set.seed(42)
ggplot(sample_n(complete_sales_data, 4000), aes(x = shipping_weight, y = invoice_amount)) +
  geom_point(alpha = 0.2, position = 'jitter') +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma)