Discovering Associations Between Products

Blackwell Electronics’ board of directors is considering acquiring Electronidex, a start-up electronics online retailer. The board of directors has asked us to help them better understand the clientele that Electronidex currently is serving and if it would be an optimal partnership.

They need our help to identify purchasing patterns that will provide insight into Electronidex’s clientele. The CSV file contains one month’s (30 days’ worth) of Electronidexes online transactions and a file containing all the electronics that they currently sell. Due to their lack of funding, Electronidex is only able to pull data on the items that customers purchased per their transactions.

tr <- read.transactions('C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 3/ElectronidexTransactions2017.csv', format = 'basket', sep=',')

## Warning in readLines(file, encoding = encoding): incomplete final line
## found on 'C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 3/
## ElectronidexTransactions2017.csv'

## Warning in asMethod(object): removing duplicated items in transactions

tr

## transactions in sparse format with
##  9835 transactions (rows) and
##  125 items (columns)

summary(tr)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  125 columns (items) and a density of 0.03506172 
## 
## most frequent items:
##                     iMac                HP Laptop CYBERPOWER Gamer Desktop 
##                     2519                     1909                     1809 
##            Apple Earpods        Apple MacBook Air                  (Other) 
##                     1715                     1530                    33622 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##    2 2163 1647 1294 1021  856  646  540  439  353  247  171  119   77   72   56 
##   16   17   18   19   20   21   22   23   25   26   27   29   30 
##   41   26   20   10   10   10    5    3    1    1    3    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.383   6.000  30.000 
## 
## includes extended item information - examples:
##                             labels
## 1 1TB Portable External Hard Drive
## 2 2TB Portable External Hard Drive
## 3                   3-Button Mouse

We can see 9.835 transactions made. We have 125 items.

Density: The percentage of non-empty cells in the sparse matrix. In another words, the total number of items that are purchased divided by the total number of possible items in that matrix. We can calculate how many items were purchased using density like so: 43.104 items.

The most frequent items are the iMac, HP Laptop, CYBERPOWER Gamer Desktop, Apple Earpods and the Apple McBook Air.

Looking at the size of the transactions: 2163 transactions were for just 1 item, 1647 transactions for 2 items, all the way up to the biggest transaction: 1 transaction for 30 items. This indicates that most customers buy a small number of items in each transaction. The distribution of the data is right skewed.

We can also see that we have two rows with no transactions at all.

itemFrequencyPlot(tr, topN=20, type='absolute', col="#003366")

The iMac is the top selling product in Electrodinex, followed by the HP laptop, as told before.

For future purposes, we’ll be creating a dataframe with all the products with the frequency they were bought.

df <- as.data.frame(itemFrequency(tr, type='absolute'))
df <- cbind(Product = rownames(df), df)
rownames(df) <- 1:nrow(df)
colnames(df) <- c('Products','Count')

Let’s create some rules, we’ll be first doing it by confidence, which is the probability that a rule is correct with items on the left for a new transaction. Confidence tells us what percentage of transactions with item A also have item B. (e.g., how many transactions that have bread also have butter).

rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [125 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [635 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules <- sort(rules, by='confidence', decreasing = TRUE)
summary(rules)

## set of 635 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  30 303 269  33 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    4.00    4.00    4.48    5.00    6.00 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.123  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.304  
##  Median :0.001118   Median :0.8462   Median :0.001322   Median : 3.579  
##  Mean   :0.001268   Mean   :0.8661   Mean   :0.001475   Mean   : 4.042  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.416  
##  Max.   :0.005287   Max.   :1.0000   Max.   :0.006507   Max.   :10.069  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :11.00  
##  Mean   :12.47  
##  3rd Qu.:13.00  
##  Max.   :52.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr          9835   0.001        0.8
##                                                            call
##  apriori(data = tr, parameter = list(supp = 0.001, conf = 0.8))

rules <- rules[!is.redundant(rules)]

inspect(rules[1:10])

##      lhs                                         rhs             support confidence    coverage     lift count
## [1]  {Brother Printer,                                                                                        
##       Halter Acrylic Monitor Stand}           => {iMac}      0.001118454          1 0.001118454 3.904327    11
## [2]  {ASUS Monitor,                                                                                           
##       Mackie CR Speakers,                                                                                     
##       ViewSonic Monitor}                      => {iMac}      0.001016777          1 0.001016777 3.904327    10
## [3]  {Apple Magic Keyboard,                                                                                   
##       Rii LED Gaming Keyboard & Mouse Combo,                                                                  
##       ViewSonic Monitor}                      => {iMac}      0.001728521          1 0.001728521 3.904327    17
## [4]  {ASUS Monitor,                                                                                           
##       Koss Home Headphones,                                                                                   
##       Microsoft Office Home and Student 2016} => {iMac}      0.001016777          1 0.001016777 3.904327    10
## [5]  {Acer Aspire,                                                                                            
##       Koss Home Headphones,                                                                                   
##       ViewSonic Monitor}                      => {HP Laptop} 0.001220132          1 0.001220132 5.151912    12
## [6]  {Dell Desktop,                                                                                           
##       Koss Home Headphones,                                                                                   
##       ViewSonic Monitor}                      => {HP Laptop} 0.001118454          1 0.001118454 5.151912    11
## [7]  {ASUS 2 Monitor,                                                                                         
##       Dell Desktop,                                                                                           
##       Logitech Keyboard}                      => {iMac}      0.001016777          1 0.001016777 3.904327    10
## [8]  {Alienware Laptop,                                                                                       
##       ASUS Desktop,                                                                                           
##       Lenovo Desktop Computer}                => {iMac}      0.001016777          1 0.001016777 3.904327    10
## [9]  {Brother Printer,                                                                                        
##       Dell Desktop,                                                                                           
##       Epson Printer}                          => {iMac}      0.001118454          1 0.001118454 3.904327    11
## [10] {Apple Magic Keyboard,                                                                                   
##       Brother Printer,                                                                                        
##       ViewSonic Monitor}                      => {iMac}      0.001016777          1 0.001016777 3.904327    10

topRules <- rules[1:10]
plot(topRules, col = rainbow(4, start = 0.35, end = 0.8))

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(topRules, method="graph", control=list("items"),col = rainbow(4, start = 0.35, end = 0.8))

## Warning: Unknown control parameters:

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

plot(topRules, method = "grouped", col = rainbow(4, start = 0.35, end = 0.8))

Our results clearly show how the iMac is the top selling product and why. There are a lot of different packages of products bought that include the iMac, which make us believe that Electrodinex has a B2B business and also a B2C.

Now, we’ll be doing the same, but by lift, which is the ratio that exceeds the expected confidence by the confidence of the rule. The ratio of the number of respondents obtained with the model to the number obtained without the model is known as lift:

Lift (A => B) = 1 means that within a set of elements there is no correlation. Lift (A = > B) > 1 means a positive correlation is more commonly purchased between the products in the product set, i.e. in items A and B. Lift(A =>B) <1 means that it is unlikely to be purchased together for the negative correlation of the itemset, i.e. the products in the item set, A, and B.

rulesL <- apriori(tr, parameter = list(supp=0.001, conf=0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [125 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [635 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rulesL <- sort(rules, by='lift', decreasing = TRUE)
summary(rulesL)

## set of 591 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  30 302 239  20 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.421   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.123  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.304  
##  Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.579  
##  Mean   :0.001276   Mean   :0.8654   Mean   :0.001486   Mean   : 4.024  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.416  
##  Max.   :0.005287   Max.   :1.0000   Max.   :0.006507   Max.   :10.069  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :12.55  
##  3rd Qu.:13.00  
##  Max.   :52.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr          9835   0.001        0.8
##                                                            call
##  apriori(data = tr, parameter = list(supp = 0.001, conf = 0.8))

rulesL <- rulesL[!is.redundant(rulesL)]


inspect(rulesL[1:10])

##      lhs                                       rhs                     support confidence    coverage      lift count
## [1]  {Apple MacBook Pro,                                                                                             
##       HP Black & Tri-color Ink,                                                                                      
##       HP Laptop,                                                                                                     
##       iMac}                                 => {Acer Aspire}       0.001016777  0.8333333 0.001220132 10.068591    10
## [2]  {Dell Desktop,                                                                                                  
##       iMac,                                                                                                          
##       Lenovo Desktop Computer,                                                                                       
##       Mackie CR Speakers}                   => {ViewSonic Monitor} 0.001118454  1.0000000 0.001118454  9.064516    11
## [3]  {Dell Desktop,                                                                                                  
##       Lenovo Desktop Computer,                                                                                       
##       Mackie CR Speakers}                   => {ViewSonic Monitor} 0.001220132  0.9230769 0.001321810  8.367246    12
## [4]  {Dell Desktop,                                                                                                  
##       Etekcity Power Extension Cord Cable,                                                                           
##       HP Laptop,                                                                                                     
##       iMac,                                                                                                          
##       Lenovo Desktop Computer}              => {ViewSonic Monitor} 0.001118454  0.9166667 0.001220132  8.309140    11
## [5]  {Acer Aspire,                                                                                                   
##       Apple Earpods,                                                                                                 
##       HP Laptop,                                                                                                     
##       HP Monitor}                           => {ViewSonic Monitor} 0.001016777  0.9090909 0.001118454  8.240469    10
## [6]  {HP Laptop,                                                                                                     
##       iMac,                                                                                                          
##       Lenovo Desktop Computer,                                                                                       
##       Mackie CR Speakers}                   => {ViewSonic Monitor} 0.001321810  0.8666667 0.001525165  7.855914    13
## [7]  {Acer Aspire,                                                                                                   
##       Dell Desktop,                                                                                                  
##       Epson Printer,                                                                                                 
##       HP Laptop}                            => {ViewSonic Monitor} 0.001220132  0.8571429 0.001423488  7.769585    12
## [8]  {Acer Aspire,                                                                                                   
##       ASUS Chromebook,                                                                                               
##       Dell Desktop,                                                                                                  
##       HP Laptop}                            => {ViewSonic Monitor} 0.001220132  0.8571429 0.001423488  7.769585    12
## [9]  {Dell Desktop,                                                                                                  
##       HP Laptop,                                                                                                     
##       iMac,                                                                                                          
##       Mackie CR Speakers}                   => {ViewSonic Monitor} 0.001321810  0.8125000 0.001626843  7.364919    13
## [10] {Acer Aspire,                                                                                                   
##       HP Laptop,                                                                                                     
##       Koss Home Headphones}                 => {ViewSonic Monitor} 0.001220132  0.8000000 0.001525165  7.251613    12

Interesting results, when we see it using “lift” it looks like the iMac doesn’t show as much as before for the rhs.

topRulesL <- rulesL[1:10]
plot(topRulesL, col = rainbow(4, start = 0.35, end = 0.8))

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(topRulesL, method="graph", control=list("items"),col = rainbow(4, start = 0.35, end = 0.8))

## Warning: Unknown control parameters:

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

In this case, we have a similar story but with a different product, which is the ViewSonic Monitor. Still we can see that we have many products bought together with this monitor, this validates our previous thought of this being a B2B company with also B2C transactions. We’ll keep doing some research in order to get some more insight.

Let’s see our top 5 “Packages” bought with the iMac.

rules1 <- apriori(tr, parameter = list(supp=0.001, conf=0.05, minlen=2),
 appearance = list(default="lhs",rhs="iMac"),
 control = list(verbose=F))
 
 rules1<-sort(rules1, decreasing=TRUE, by="confidence")
 inspect(rules1[1:5])

##     lhs                                         rhs        support confidence    coverage     lift count
## [1] {Brother Printer,                                                                                   
##      Halter Acrylic Monitor Stand}           => {iMac} 0.001118454          1 0.001118454 3.904327    11
## [2] {ASUS Monitor,                                                                                      
##      Mackie CR Speakers,                                                                                
##      ViewSonic Monitor}                      => {iMac} 0.001016777          1 0.001016777 3.904327    10
## [3] {Apple Magic Keyboard,                                                                              
##      Rii LED Gaming Keyboard & Mouse Combo,                                                             
##      ViewSonic Monitor}                      => {iMac} 0.001728521          1 0.001728521 3.904327    17
## [4] {ASUS Monitor,                                                                                      
##      Koss Home Headphones,                                                                              
##      Microsoft Office Home and Student 2016} => {iMac} 0.001016777          1 0.001016777 3.904327    10
## [5] {ASUS 2 Monitor,                                                                                    
##      Dell Desktop,                                                                                      
##      Logitech Keyboard}                      => {iMac} 0.001016777          1 0.001016777 3.904327    10

It looks like most of them are related to business needs.

rules2 <- apriori(tr, parameter = list(supp=0.001, conf=0.15, minlen=2),
 appearance = list(default="rhs",lhs="iMac"),
 control = list(verbose=F))
 
 rules2<-sort(rules2, decreasing=TRUE, by="confidence")
 inspect(rules2)

##     lhs       rhs                        support    confidence coverage 
## [1] {iMac} => {HP Laptop}                0.07554652 0.2949583  0.2561261
## [2] {iMac} => {Lenovo Desktop Computer}  0.05876970 0.2294561  0.2561261
## [3] {iMac} => {CYBERPOWER Gamer Desktop} 0.05673615 0.2215165  0.2561261
## [4] {iMac} => {Dell Desktop}             0.05460092 0.2131798  0.2561261
## [5] {iMac} => {ViewSonic Monitor}        0.04941535 0.1929337  0.2561261
## [6] {iMac} => {Apple Earpods}            0.04026436 0.1572052  0.2561261
##     lift      count
## [1] 1.5195993 743  
## [2] 1.5499321 578  
## [3] 1.2043198 558  
## [4] 1.5907615 537  
## [5] 1.7488507 486  
## [6] 0.9015239 396

We can also see that the HP laptop is the most frequent product bought together with the iMac and some other products that looks to be more of a business purchase than a single customer one.

It can be easily illustrated in the following plot:

plot(rules2, method="graph", control=list("items"),col = rainbow(4, start = 0.35, end = 0.8))

## Warning: Unknown control parameters:

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

Now that we have this information, let’s try and group the products into categories in order to see if Blackwell will benefit by acquiring Electronidex.

We previously did an analysis of the products categorized that Blackwell is currently selling and made a sales volume prediction using a Random Forest Regression Model of the new four different product types (PC, Laptops, Netbooks and Smartphones). We’ll be importing the data we have on the products types that Blackwell is currently selling + the sales volume prediction of the new ones and make a new analysis in order to see if the acquisition of Electrodinex will benefit us. For us to do this, we’ll first categorized the products that Electrodinex are currently selling and then keep up with our analysis.

table1 <- pdf_text("C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 3/ElectronidexItems2017.pdf") %>%
  read_lines()

full_table['Category'] <- NA

full_table[1:10, ]['Category'] <- "Laptops"
full_table[11:19, ]['Category'] <- "Desktop"
full_table[20:29, ]['Category'] <- "Monitor"
full_table[30:39, ]['Category'] <- "Computer Mice"
full_table[40:48, ]['Category'] <- "Keyboard"
full_table[49:57, ]['Category'] <- "Mouse and Keyboard Combo"
full_table[58:67, ]['Category'] <- "Computer Headphones"
full_table[68:73, ]['Category'] <- "Active Headphones"
full_table[74:82, ]['Category'] <- "Computer Cords"
full_table[83:86, ]['Category'] <- "Accessories"
full_table[87:95, ]['Category'] <- "Speakers"
full_table[96:100, ]['Category'] <- "Printers"
full_table[101:105, ]['Category'] <- "Printer Ink"
full_table[106:110, ]['Category'] <- "Computer Stands"
full_table[111:115, ]['Category'] <- "Computer Tablets"
full_table[116:120, ]['Category'] <- "External Hardrives"
full_table[121:125, ]['Category'] <- "Smart Home Devices"

full_table <- full_table[order(full_table$Products), c(1,2)]

df_final <- data.frame(df, full_table)

Great, now we have our full table with the number of times the product was purchased + its category.

Now, we want to see which categories has the most number of products

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

n_categories <- tabyl(df_final, Category)
n_categories

##                  Category  n percent
##               Accessories  4   0.032
##         Active Headphones  6   0.048
##            Computer Cords  9   0.072
##       Computer Headphones 10   0.080
##             Computer Mice 10   0.080
##           Computer Stands  5   0.040
##          Computer Tablets  5   0.040
##                   Desktop  9   0.072
##        External Hardrives  5   0.040
##                  Keyboard  9   0.072
##                   Laptops 10   0.080
##                   Monitor 10   0.080
##  Mouse and Keyboard Combo  9   0.072
##               Printer Ink  5   0.040
##                  Printers  5   0.040
##        Smart Home Devices  5   0.040
##                  Speakers  9   0.072

n_categories <- as.data.frame(n_categories)
n_categories <- n_categories[order(n_categories$n, decreasing = TRUE), c(1,2,3)]

p <-ggplot(data=n_categories, aes(x = reorder(Category,-n), y = n, fill=n)) +
  geom_bar(stat="identity") + 
  scale_fill_gradient(low="darkorange1",high="#003366", name="#")+
  labs(title="Number of Products per Category",
        x ="Category", y = "Count") +
  geom_text(aes(label=n), vjust=1.5, hjust=0.5 , color="white", size=3.5)+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
p

The plot clearly shows the number of different products per category.

Now, let’s see the number of purchases per category made and see which groups stand out.

df_count <- aggregate(df_final$Count, list(df_final$Category), FUN=sum)

df_count <- df_count[order(df_count$x, decreasing = TRUE), c(1,2)]

sum(df_count$x)

## [1] 43104

p <-ggplot(data=df_count, aes(x = reorder(Group.1,x), y = x, fill=x)) +
  geom_bar(stat="identity") + 
  scale_fill_gradient(low="darkorange1",high="#003366", name="Count")+
  labs(title="Number of purchases per Category",
        x ="Category", y = "Count") +
  geom_text(aes(label=x), vjust=0.5, hjust=1.2 , color="white", size=3.5)+
  theme_minimal() +
  coord_flip()
p

We can see that Desktop, Laptops and Monitors are the products that were most bought.

Now, we’ll be importing the transactions of the products types that Blackwell is currently selling and join them with the new products with their respective sales volume prediction. This will give us a clear view on what segmentation Blackwell is failing and see if the acquisition of Electrodinex might help gain some ground that area.

ExistingProducts <- read.csv(("C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 2/existingproducts.csv"))
NewProducts <- read.csv(("C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 2/newproducts.csv"))

mean_volume <- aggregate(ExistingProducts$Volume, list(ExistingProducts$ProductType), FUN=mean)
mean_volume <- mean_volume[order(mean_volume$x, decreasing = TRUE), c(1,2)]
mean_volume

##             Group.1          x
## 4       GameConsole 4360.00000
## 3  ExtendedWarranty  987.60000
## 1       Accessories  982.61538
## 11         Software  711.33333
## 2           Display  485.60000
## 10       Smartphone  452.00000
## 12           Tablet  316.00000
## 5            Laptop  172.00000
## 8           Printer  169.66667
## 6           Netbook   46.00000
## 7                PC   29.00000
## 9   PrinterSupplies   14.66667

ExistingProducts <- ExistingProducts %>% select(ProductType,Price, Volume, ProfitMargin)

NewProducts <- NewProducts %>% select(ProductType,Price, Volume, ProfitMargin)

prueba <- rbind(ExistingProducts, NewProducts)

“Prueba” is now the complete dataset, “Existing Products” and “New Products” with their sales volume prediction that Blackwell is currently selling.

Let’s calculate the mean sales volume per category and see which ones stand out.

mean_volume <- aggregate(prueba$Volume, list(prueba$ProductType), FUN=mean)
mean_volume <- mean_volume[order(mean_volume$x, decreasing = TRUE), c(1,2)]
mean_volume$x <- round(mean_volume$x, 2)
mean_volume

##             Group.1       x
## 4       GameConsole 3628.34
## 12           Tablet 1596.81
## 1       Accessories  916.32
## 3  ExtendedWarranty  904.16
## 11         Software  626.08
## 2           Display  408.37
## 10       Smartphone  394.97
## 6           Netbook  261.23
## 8           Printer  167.35
## 5            Laptop  134.10
## 7                PC  129.95
## 9   PrinterSupplies   14.67

p <-ggplot(data=mean_volume, aes(x = reorder(Group.1,x), y = x, fill = x)) +
  coord_flip() +
  scale_fill_gradient(low="darkorange1",high="#003366", name="Amount")+
  geom_bar(stat="identity") +
  labs(title="Mean Sales Volume per Category",
        x ="Category", y = "Amount") +
  geom_text(aes(label=x), vjust=0.5, hjust=1.2 , color="white", size=3)+
  theme_minimal()
p

This tells us that the product type with the highest sales volume mean is Game Console, but lets keep diving into it, it doesn’t seem to be telling the full story.

Let’s create a pie chart with the sum of the sales volume

sales_volume <- aggregate(prueba$Volume, list(prueba$ProductType), FUN=sum)
sales_volume <- sales_volume[order(sales_volume$x, decreasing = TRUE), c(1,2)]
sales_volume$x <- round(sales_volume$x, 2)
sales_volume

##             Group.1        x
## 1       Accessories 25656.86
## 4       GameConsole 14513.37
## 3  ExtendedWarranty  9945.77
## 12           Tablet  7984.07
## 11         Software  4382.56
## 10       Smartphone  3159.76
## 2           Display  2450.21
## 8           Printer  2175.54
## 6           Netbook  1567.39
## 5            Laptop   804.58
## 7                PC   779.71
## 9   PrinterSupplies    58.67

sum(sales_volume$x)

## [1] 73478.49

p <-ggplot(data=sales_volume, aes(x = reorder(Group.1,x), y = x, fill = x)) +
  coord_flip() +
  scale_fill_gradient(low="darkorange1",high="#003366", name="Amount")+
  geom_bar(stat="identity") +
  labs(title="Sales Volume per Category",
        x ="Category", y = "Amount") +
  geom_text(aes(label=x), vjust=0.5, hjust=1.2 , color="white", size=3)+
  theme_minimal()
p

fig <- plot_ly(sales_volume, labels = ~Group.1, values = ~x, type = 'pie')
fig <- fig %>% layout(title = 'Sales Volume per Category',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig

Crazy turnaround, it looks like Game Console is not the one with the highest sales despite having the highest mean. Accessories is the product type that has the highest sales volume of Blackwell at the moment and it seems that PC (Desktop) and Laptops represent just a little portion based on our prediction of our sales, situation that could be heavily improven by the acquisition of Electrodinex, given that the two most purchased product types are Laptops and Desktops, followed by Monitors and other categories that can help the company take a bigger proportion of the market.

For last, let’s calculate the mean profit margin per category in order to see which are the ones more profitable and find out if those belong to the top selling product types of Electronidex.

margin_profit <- aggregate(prueba$ProfitMargin, list(prueba$ProductType), FUN=mean)
margin_profit <- margin_profit[order(margin_profit$x, decreasing = TRUE), c(1,2)]
margin_profit$x <- round(margin_profit$x, 2)
margin_profit

##             Group.1    x
## 3  ExtendedWarranty 0.40
## 9   PrinterSupplies 0.31
## 8           Printer 0.20
## 11         Software 0.19
## 7                PC 0.17
## 12           Tablet 0.16
## 2           Display 0.15
## 4       GameConsole 0.14
## 5            Laptop 0.13
## 10       Smartphone 0.12
## 6           Netbook 0.10
## 1       Accessories 0.05

p <-ggplot(data=margin_profit, aes(x = reorder(Group.1,x), y = x, fill=x)) +
  geom_bar(stat="identity") + 
  scale_fill_gradient(low="darkorange1",high="#003366", name="Profit Margin")+
  labs(title="Mean Profit Margin per Category",
        x ="Category", y = "Profit Margin") +
  geom_text(aes(label=x), vjust=0.5, hjust=1.2 , color="white", size=3.5)+
  theme_minimal() +
  coord_flip()
p

Clearly not the results we wanted, still we have one last thing we can try.

Given that we have the number of transactions made for each product type, we are going to calculate the sales volume of the categories that the acquisition of Electronidex would add to our company (for now, only the ones we share). For this calculation we’ll be using the mean price of our products types and multiply it by the count of purchases made to see an estimate of how much we are talking about.

mean_price <- aggregate(prueba$Price, list(prueba$ProductType), FUN=mean)
mean_price <- mean_price[order(mean_price$x, decreasing = TRUE), c(1,2)]
mean_price$x <- round(mean_price$x, 2)
mean_price

##             Group.1       x
## 5            Laptop 1109.60
## 7                PC  961.16
## 2           Display  481.75
## 12           Tablet  449.00
## 6           Netbook  371.33
## 4       GameConsole  276.00
## 8           Printer  216.02
## 10       Smartphone  211.75
## 11         Software  162.36
## 3  ExtendedWarranty  154.48
## 1       Accessories   56.83
## 9   PrinterSupplies   43.52

p <-ggplot(data=mean_price, aes(x = reorder(Group.1,-x), y = x, fill=x)) +
  geom_bar(stat="identity") + 
  scale_fill_gradient(low="darkorange1",high="#003366", name="Price")+
  labs(title="Mean Price per Category",
        x ="Category", y = "Price") +
  geom_text(aes(label=x), vjust=-0.5, hjust=0.5 , color="black", size=3.5)+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
p

We can see that Laptops and PC are the ones with the highest price.

sv <- c("Category", "Count")
mp <- c("Category", "Price")
pm <- c("Category", "Profit_Margin")

colnames(mean_price) <- mp
colnames(df_count)<- sv
colnames(margin_profit) <- pm

Let’s change the names of the categories that mean the same in order to join them.

df_count[df_count$Category=="Laptops", "Category"] <- "Laptop"
df_count[df_count$Category=="Desktop", "Category"] <- "PC"
df_count[df_count$Category=="Monitor", "Category"] <- "Display"
df_count[df_count$Category=="Printers", "Category"] <- "Printer"
df_count[df_count$Category=="Printer Ink", "Category"] <- "PrinterSupplies"

intersect(df_count$Category, mean_price$Category)

## [1] "PC"              "Laptop"          "Display"         "Accessories"    
## [5] "Printer"         "PrinterSupplies"

We’ll be using SQL to make it simpler.

library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

df2 = sqldf("
      SELECT mp.Category, mp.Price, sv.Count FROM mean_price mp JOIN df_count sv
      ON mp.Category=sv.Category
      ")

df2 = sqldf("
      SELECT Category, Price, Count, Price*Count as Sales
      FROM df2
      ")

p <-ggplot(data=df2, aes(x = reorder(Category, -Sales), y = Sales, fill=Sales)) +
  geom_bar(stat="identity") + 
  scale_fill_gradient(low="darkorange1",high="#003366", name="Sales")+
  labs(title="Sales per Category",
        x ="Category", y = "Sales") +
  geom_text(aes(label=Sales), vjust=-0.5, hjust=0.5 , color="black", size=3.5)+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
p

The previous plot shows us the sales that could represent the categories in the chart.PC and Laptops are the ones that stand out the most.

sum(df2$Sales)

## [1] 20494015

The total of sales could be around 20.494.015 by just taking into account the categories we share, we are still missing some.

prueba = sqldf("
      SELECT ProductType, Price, Volume, ProfitMargin, Price*Volume as Sales
      FROM prueba
      ")

view(prueba)

sum(prueba$Sales)

## [1] 13757056

13.757.056 is the current number of sales that we currently have (taking into account the predictions we made about the new products)

sum(df2$Sales) - sum(prueba$Sales)

## [1] 6736959

Now we can definitely confirm that the acquisition of Electrodinex would be a great opportunity for Blackwell. Given a simple calculation of the mean price of certain products that we both share times the sales volume we can estimate sales of around 20.494.015, way bigger of what we currently manage at the moment which is 13.942.795 by our calculations taking into account the prediction we made about our new products. That’s 6.736.959 more in sales, just taking into account the products we already share together.

We could also calculate the profit we could be generating by the sale of this related items from Electrodinex.

df3 = sqldf("
      SELECT mp.Category, mp.Profit_Margin, sv.Sales FROM margin_profit mp JOIN df2 sv
      ON mp.Category=sv.Category
      ")

df3 = sqldf("
      SELECT Category, Profit_Margin, Sales, Profit_Margin*Sales as Profit FROM df3
      ")

sum(df3$Profit)

## [1] 3097882

Our profit would be of 3.097.882, what’s left is to look at the financial situation and evaluation of the company to see if it would be a smart decision to acquire it. But just looking at the information we have in our hands, it looks like it would be a smart idea for Blackwell to make the acquisition of Electrodinex.

Appendix

fig <- plot_ly(df3, labels = ~Category, values = ~Sales, type = 'pie')
fig <- fig %>% layout(title = 'Sales per Category',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig

Discovering Associations Between Products

Andres Marquez

2022-10-27

Appendix