HRRM Case Study - Data Analysis Workflow

library(tidyverse)
library(skimr)
library(janitor)
library(scales)
library(ggrepel)
library(RColorBrewer)
library(hrbrthemes)
library(gt)

options(scipen = 999)
knitr::opts_chunk$set(error = T)

Read in raw data from csv files using read.csv function.

base_customer_raw <- read.csv("BaseCustomer.csv", header = TRUE)
customer_opportunity_raw <- read.csv("CustomerOpportunity.csv", header = TRUE)
product_mapping_raw <- read.csv("ProductMapping.csv", header = TRUE)

Data cleaning and tidying

For this initial step I wanted to make sure that my data frames had uniform column names with conventional titles. I used the janitor package to adjust capitalization/punctuation differences and used the coalesce function to combine column data in the product_mapping data frame. The base_customer data frame had 14 rows which included NAs for almost all columns. These observations were removed from the data frame. Lastly, I used the compare_df_cols_same function to ensure that the data frames did not lose any observations during this process.

base_customer <- janitor::clean_names(base_customer_raw) |> 
  drop_na(customer) 
base_customer$customer_industry_desc <- tolower(base_customer$customer_industry_desc)
compare_df_cols_same(base_customer, base_customer_raw)

## [1] TRUE

customer_opportunity <- janitor::clean_names(customer_opportunity_raw) |> 
  rename(product_category_id = product_category4_id, 
         product_category = product_category2_text, 
         subcategory = product_category3_text) |> 
  mutate(product_category = na_if(product_category, "#N/A"), 
         subcategory = na_if(subcategory, "#N/A"))

compare_df_cols_same(customer_opportunity, customer_opportunity_raw)

## [1] TRUE

product_mapping <- janitor::clean_names(product_mapping_raw, parsing_option = 3) |> 
  mutate(cat1 = coalesce(category1, category1_2, ca_tegory1), 
                  cat2 = coalesce(category2, categ_ory2), 
                  cat3 = coalesce(categor_y3, category3, category3_2, category3_3, category3_4)) |> 
  rename(product_category_id = product_category4_id, 
         cat4 = category4) |> 
  select(product_category_id, cat1, cat2, cat3, cat4)
compare_df_cols_same(product_mapping, product_mapping_raw)

## [1] TRUE

Next, I would like to combine my data sets into one data frame which will allow me to analyze trends and look for unique observations between all variables. Then, I would like to transform my product_category and subcategory columns to contain numeric values only. As a final step of organizing my data, I would also like to reorder my data in order to make it easier to read, digest, and analyze.

new_customer_opportunity <- customer_opportunity |> 
  left_join(product_mapping, by = 'product_category_id')
total_data <- new_customer_opportunity |> 
  merge(base_customer) |> 
  select(-c(currency, pursuit_start_date:date_won_loss_abandoned, customer_text, customer_sic_text, probability_prc, weighted_sales)) |>
  relocate(snapshot_fy, snapshot_fiscal_year_period, snapshot_fp, everything()) 
total_data <- total_data |> 
  mutate(product_category = parse_number(total_data$product_category), 
         subcategory = parse_number(total_data$subcategory))

Exploratory Analysis

First, I would like to look at both overall number of sales, total sales in dollars, and average dollars per sale for each product category and product subcategory.

cat_sales <- total_data |> 
  filter(!is.na(product_category)) |> 
  group_by(product_category, subcategory) |> 
  summarize(n_sales = n(), 
            tot_sales = sum(total_sales), 
            avg_sales = round(tot_sales/n(), 2)) |> 
  arrange(desc(n_sales))

Next, I wanted to create a subset of the data in order to identify areas of growth from fiscal year 2019 to fiscal year 2020. Then, I want to calculate each fiscal year total sales and average sales per fiscal period. From here we can identify the direction of particular trends in hopes of learning more about the direction of growth. This code allows us to determine which product categories had the highest growth from fiscal year 2019 to fiscal year 2020, both in total sales and average sales per fiscal period.

total_2019_data <- total_data |> 
  filter(snapshot_fy == 2019) |> 
  group_by(product_category) |> 
  summarize(n_sales_2019 = n(), 
            tot_sales_2019 = sum(total_sales), 
            avg_per_fp_2019 = round(tot_sales_2019/n(), 2)) |> 
  arrange(desc(n_sales_2019))
total_2020_data<- total_data |> 
  filter(snapshot_fy == 2020) |> 
  group_by(product_category) |> 
  summarize(n_sales_2020 = n(), 
            tot_sales_2020 = sum(total_sales), 
            avg_per_fp_2020= round(tot_sales_2020/n(), 2)) |> 
  arrange(desc(n_sales_2020))
total_growth_data <- total_2019_data |> 
  merge(total_2020_data) |> 
  mutate(growth_n_sales = n_sales_2020 - n_sales_2019, 
         growth_tot_sales = tot_sales_2020 - tot_sales_2019, 
         growth_tot_sales_pct = round((growth_tot_sales/tot_sales_2019) *100, 2), 
         growth_avg_sales = avg_per_fp_2020 - avg_per_fp_2019, 
         growth_avg_sales_pct = round((growth_avg_sales/avg_per_fp_2019) *100, 2), 
         overall_tot_sales = tot_sales_2019 + tot_sales_2020) |> 
  arrange(desc(growth_tot_sales_pct), overall_tot_sales)

After noticing the extreme positive growth in product category 8, I wanted to identify the largest increase of percentage growth in total sales. Here I examined the growth from each fiscal year period to the next and found that the fiscal period at the end of 2019 leading in to the beginning of 2020 saw the largest increase in total sales percentage growth (96.6%).

growth <- total_data |> 
  group_by(snapshot_fp) |> 
  filter(product_category == 8)|> 
  summarize(tot_sales = sum(total_sales)) |> 
  mutate(fp = snapshot_fp - lag(snapshot_fp), 
         diff_growth = tot_sales - lag(tot_sales), 
         pct_growth = (diff_growth)/lag(tot_sales)* 100)

Next, I wanted to look at the sales performance of category provided in the product mapping data set.

category <- total_data |> 
  mutate(category = case_when(cat1 == 1 ~ 1, 
                              cat2 == 1 ~ 2, 
                              cat3 == 1 ~ 3, 
                              cat4 == 1 ~ 4))

new_category <- category |> 
  filter(!is.na(category), !is.na(product_category)) |> 
  group_by(category, customer_industry_desc) |> 
  summarize(n = n(), 
            tot_sales = sum(total_sales), 
            avg_per_fp = tot_sales/n) |> 
  mutate(customer_industry_desc = parse_number(customer_industry_desc))

Visualizing the Data

Here is the code used to identify the highest percentage growth in total sales from 2019 to 2020.

total_growth_data |> 
  slice_max(growth_tot_sales_pct, n = 10) |> 
  select(product_category, tot_sales_2019, tot_sales_2020, growth_tot_sales_pct) |> 
    mutate(tot_sales_2019 = comma(tot_sales_2019), 
           tot_sales_2020 = comma(tot_sales_2020), 
           growth_tot_sales_pct = comma(growth_tot_sales_pct)) |> 
  gt() |> 
  fmt_currency(columns = c(tot_sales_2019, tot_sales_2020)) |> 
  tab_header(title = md("Largest **Growth** in Total Sales"), 
             subtitle = "From Fiscal Year 2019 to Fiscal Year 2020") |> 
  cols_label(product_category = 'Product Category', 
             tot_sales_2019 = "2019 Total Sales", 
             tot_sales_2020 = "2020 Total Sales", 
             growth_tot_sales_pct = "Percent Growth") |> 
data_color(columns = product_category,
           palette = "gray97") |> 
data_color(columns = tot_sales_2019, 
           fn = col_numeric(palette = "Greens", 
                            domain = c(40000000, 5000000000))) |> 
data_color(columns = tot_sales_2020, 
           fn = col_numeric(palette = "Greens", 
                            domain = c(123000000, 104000000000))) |> 
data_color(columns = growth_tot_sales_pct, 
           fn = col_numeric(palette = "Greens", 
                                    domain = c(75, 2040)))

## Error in UseMethod("rescale"): no applicable method for 'rescale' applied to an object of class "character"

Now we will look at the product category with the largest decline in total sales as a percentage between fiscal year 2019 and fiscal year 2020.

total_growth_data |> 
  filter(!is.na(product_category)) |> 
  slice_min(growth_tot_sales, n = 10) |> 
  mutate(growth_sales = comma(tot_sales_2020 - tot_sales_2019), 
         growth_tot_sales_pct = round((growth_tot_sales/tot_sales_2019) *100, 2), 
         tot_sales_2019 = comma(tot_sales_2019), 
           tot_sales_2020 = comma(tot_sales_2020)) |> 
  select(product_category, tot_sales_2019, tot_sales_2020, growth_sales, growth_tot_sales_pct) |> arrange(growth_tot_sales_pct) |> 
  gt() |> 
  fmt_currency(columns = c(tot_sales_2019,tot_sales_2020, growth_sales)) |> 
  tab_header(title = md("Largest **Decline** in Total Sales"), 
             subtitle = "From Fiscal Year 2019 to Fiscal Year 2020") |> 
  cols_label(product_category = 'Product Category', 
             tot_sales_2019 = "2019 Total Sales", 
             tot_sales_2020 = "2020 Total Sales", 
             growth_sales = "Difference in Income",
             growth_tot_sales_pct = "Percentage Change in Total Sales") |> 
data_color(columns = product_category,
           palette = "gray97") |> 
data_color(columns = tot_sales_2019, 
           fn = col_numeric(palette = "Greens", 
                            domain = c(70000000, 96000000000))) |> 
data_color(columns = tot_sales_2020, 
           fn = col_numeric(palette = "Greens", 
                            domain = c(9900000, 91000000000))) |> 
data_color(columns = growth_tot_sales_pct, 
           fn = col_numeric(palette = "Greens", 
                                    domain = c(-4, -99), 
                            reverse = TRUE))

## Error in UseMethod("rescale"): no applicable method for 'rescale' applied to an object of class "character"

Visualization that identifies which categories had the highest average sales per fiscal period, as observed by customer industry.

new_category |> 
ggplot(aes(customer_industry_desc, avg_per_fp, fill = factor(category))) +
  geom_bar(position = "stack", stat = "identity") +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "Average Sales per Fiscal Period", 
       subtitle = "by Customer Industry", 
       x = "Customer Industry", 
       y = "Average Sales") +
  guides(fill = guide_legend(title = "Product Mapping Category")) +
  scale_x_discrete(limits = factor(1:7)) +
  scale_y_continuous(labels = label_dollar()) +
  theme_ipsum()

I then wanted to look at each customer industry in order to assess which ones had positive growth from 2019 to 2020.

drop19 <- total_data |> 
  filter(snapshot_fy == 2019) |> 
  group_by(customer_industry_desc) |> 
  summarize(tot_2019 = comma(sum(total_sales)))
drop20 <- total_data |> 
  filter(snapshot_fy == 2020) |> 
  group_by(customer_industry_desc) |> 
  summarize(tot_2020 = comma(sum(total_sales)))
drop <- full_join(drop19, drop20) |> 
  mutate(total_drop = comma(tot_2020 - tot_2019), 
         pct_growth = comma((total_drop/tot_2019) *100))

## Joining with `by = join_by(customer_industry_desc)`

## Error in `mutate()`:
## ℹ In argument: `total_drop = comma(tot_2020 - tot_2019)`.
## Caused by error in `tot_2020 - tot_2019`:
## ! non-numeric argument to binary operator

ggplot(drop, aes(tot_2019, tot_2020)) +
  geom_point() +
  geom_text_repel(aes(label = customer_industry_desc)) +
  stat_smooth(geom = "line", method = "lm", se = F, alpha = .4, color = 'darkgreen') +
  labs(title = "Total Sales for Each Fiscal Year",
       subtitle = "Separated by Industry",
       x = "Total Sales (2019)", 
       y = "Total Sales (2020)") +
  theme(plot.title = element_text(hjust = .5), 
        plot.subtitle = element_text(hjust = .5)) +
  scale_y_continuous(label = label_dollar()) +
  scale_x_continuous(label = label_dollar()) +
  theme_classic()

## Error in `ggplot()`:
## ! `data` cannot be a function.
## ℹ Have you misspelled the `data` argument in `ggplot()`

It is very clear from this plot that Industry 4 are our customers who have the highest sales totals in each fiscal period. But we will next remove this outlier and take a closer look at the remaining 6 categories, hopefully identifying the one that we can grow the most.

drop |> 
  filter(customer_industry_desc != 'industry4') |> 
  ggplot(aes(tot_2019, tot_2020)) +
  geom_point() +
  geom_text_repel(aes(label = customer_industry_desc)) +
  stat_smooth(geom = "line", method = "lm", se = F, alpha = .4, color = 'darkgreen') +
  labs(title = "Total Sales for Each Fiscal Year",
       subtitle = "Separated by Industry",
       x = "Total Sales (2019)", 
       y = "Total Sales (2020)") +
  theme(plot.title = element_text(hjust = .5), 
        plot.subtitle = element_text(hjust = .5)) +
  scale_y_continuous(label = label_dollar()) +
  scale_x_continuous(label = label_dollar()) +
  theme_classic()

## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "function"

We can see that Industry 6 has our lowest sales totals for each fiscal period. Now we will look and see which direction each industry is trending in.

drop |> 
  arrange(desc(pct_growth)) |> 
  gt() |> 
    tab_header(title = md("Customer Industry Growth"), 
             subtitle = "From Fiscal Year 2019 to Fiscal Year 2020") |> 
  cols_label(customer_industry_desc = 'Customer Industry', 
             tot_2019 = "2019 Total Sales", 
             tot_2020 = "2020 Total Sales", 
             total_drop = "Total Growth",
             pct_growth= "Percentage Change in Total Sales") |> 
data_color(columns = customer_industry_desc,
           palette = "gray97") |> 
data_color(columns = pct_growth, 
           fn = col_numeric(palette = "RdYlGn", 
                                    domain = c(-4, 21), 
                            reverse = FALSE)) |> 
  fmt_currency(columns = c(tot_2019, tot_2020, total_drop))

## Error in UseMethod("arrange"): no applicable method for 'arrange' applied to an object of class "function"

From here we can see some very interesting patterns emerging. While Industry Category 4 has the highest total sales for both fiscal periods, it is growing at a very slow rate. This may be inconsequential since this industry category leads all sales by a vast amount. However, Industry Category 2 is growing the most. On the other hand, Industry Category 5 is trending in the wrong direction. This warrants further examination and attention in order to reverse the course of this trend.

Then, I wanted to see which product categories were having the least success in terms of total sales. For this, we will look at the product category and the product category ID in order to provide as much detail as possible.

pc <- total_data |> 
  filter(!is.na(product_category)) |> 
  group_by(customer_industry_desc, product_category) |> 
  summarize(n = n(), 
            total_sales = sum(total_sales))

## `summarise()` has grouped output by 'customer_industry_desc'. You can override
## using the `.groups` argument.

pc_max <- pc |> 
  slice_max(n, n = 5)
pc_min <- pc |> 
  slice_min(n, n = 1, with_ties = F) 
pc_min |> 
  gt() |> 
  fmt_currency(columns = total_sales) |> 
  tab_header(title = md("Lowest Selling Product Categories"), 
             subtitle = "Per Customer Industry") |> 
  cols_label(total_sales = "Total Sales", 
             n = "Number of Sales Occurrences", 
             product_category = "Product Category")

Product Category	Number of Sales Occurrences	Total Sales
Lowest Selling Product Categories
Per Customer Industry
industry1
10	1	$5,000.00
industry2
13	1	$20,000.00
industry3
5	1	$1,100,000.00
industry4
13	1	$9,000.00
industry5
1	1	$95,000.00
industry6
10	2	$10,000.00
industry7
29	1	$100,000.00

Next, I wanted to look at the Strategic Opportunity categories to see where we could focus our attention in order to reach maximum sales. I grouped the data set by the Strategic Opportunity variable and used a linear regression model to discover that the greatest relationship between total sales in dollars and number of sales took place in the Strategic Opportunity category.

strat <- total_data |> 
  filter(strategic_opportunity != "") |> 
  group_by(strategic_opportunity) |> 
  summarize(tot_sales = sum(total_sales), 
            n = n())
mod <- lm(tot_sales ~ n, strat)
summary(mod)

## 
## Call:
## lm(formula = tot_sales ~ n, data = strat)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -3234562625  -371369149  -134815976   -98205679 10046511441 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  90991344  227871720   0.399               0.691    
## n             3278262     224768  14.585 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1853000000 on 73 degrees of freedom
## Multiple R-squared:  0.7445, Adjusted R-squared:  0.741 
## F-statistic: 212.7 on 1 and 73 DF,  p-value: < 0.00000000000000022

After grouping all categories given in our data set, none came close to the Adjusted R-Squared value of .74. This tells us that 74% of the growth in total sales comes from the number of sales (n) when using Strategic Opportunity as a decider. Therefore, it is critical that we focus on improving the number of sales in the Strategic Opportunity areas which are lacking overall occurrences.

strat |> 
  slice_min(n, n = 10) |> 
  gt() |> 
  fmt_currency(columns = tot_sales) |> 
  tab_header(title = md("Strategic Opportunity Areas"), 
             subtitle = "Lowest Number of Total Sales") |> 
  cols_label(strategic_opportunity = 'Strategic Opportunity Area', 
             tot_sales = "Total Sales", 
             n = "Number of Sales Occurrences") |> 
  data_color(columns = tot_sales, 
           fn = col_numeric(palette = "Greens", 
                                    domain = c(150000, 20000000), 
                            reverse = FALSE))

Strategic Opportunity Area	Total Sales	Number of Sales Occurrences
Strategic Opportunity Areas
Lowest Number of Total Sales
VENTURES: Future of Forensic - dTrax	$1,000,000.00	1
VENTURES: Mod Com - Audio Surveillance Solutions	$500,000.00	1
VENTURES: Mod Com- CIAA	$150,000.00	1
VENTURES: Federal Expansion -Non-Priority Geos	$8,642,972.00	2
VENTURES:DCG: DevSecOps	$20,000,000.00	2
VENTURES: Digital Risk - Dgtl Finance - Dgtl Controllership	$900,000.00	3
VENTURES: Disruptive Scaled Services -Risk Sensing	$1,319,163.00	3
VENTURES: S&Q \| FoF - Blockchain Initiatives	$240,000.00	3
VENTURES: DSS - Cyber Security Risk Management	$7,200,000.00	4
VENTURES: Mod Com - Digital Testing Approach	$3,200,000.00	4
VENTURES:DCG: Cyber Ecosystem	$240,002.00	4
VENTURES:DCG: Data Destruction Recovery	$1,452,830.00	4

HRRM Case Study - Data Analysis Workflow

Patrick Milum

July 7, 2023

Data cleaning and tidying

Exploratory Analysis

Visualizing the Data