Predicting-Profitability-MBA

Introduction

In this task I will do a profitability analysis to choose the most profitable products among the company’s new product releases, and then help decide if it’s a good idea or not to acquire a competitor based on their transactional data.

Load libraries

if ("pacman" %in% rownames(installed.packages()) == FALSE) {
  install.packages("pacman")
} else {
  pacman::p_load(lattice, ggplot2, caret, readr, corrplot, e1071, reshape2,
                 dplyr, plotly, arules, arulesViz, knitr, kableExtra)
}

Profitability analysis

Getting to know the data

We were provided with two datasets with the same number of variables:

Current_Products: it has descriptive data of the 80 products on sale by the company.
New_Products: it has descriptive data of the 24 new products the company is considering to put on sale.

And the following variables:

ProductType: The product category to which the product belongs.
ProductNum: A numerical identification to differentiate each product.
Price: The selling price of the product
x5StarReviews: The amount of 5 stars reviews the product received on a scale of 1 to 5, being 1 the worst and 5 the best score.
….
x1StarReviews: The amount of 1 star reviews the product received on a scale of 1 to 5, being 1 the worst and 5 the best score.
PositiveServiceReview: The amount of positive service reviews the product received.
NegativeServiceReview: The amount of negative service reviews the product received.
Recommendproduct: A value in a scale of 0 to 1, being 0 wouldnt recomend and 1 would recomend.
BestSellersRank: A rank that shows how good a product sells compared to other products of the company.
Weight, Depth, Width and Height: The phisical caracteristics of the product.
ProfitMargin: The percentage of the selling price that the company considers a gain above cost.
Volume: The sales volume.

Here we can see a statistical description of each variable in the dataset.

summary(Current_Products)

##  ProductType          ProductNum        Price         x5StarReviews   
##  Length:80          Min.   :101.0   Min.   :   3.60   Min.   :   0.0  
##  Class :character   1st Qu.:120.8   1st Qu.:  52.66   1st Qu.:  10.0  
##  Mode  :character   Median :140.5   Median : 132.72   Median :  50.0  
##                     Mean   :142.6   Mean   : 247.25   Mean   : 176.2  
##                     3rd Qu.:160.2   3rd Qu.: 352.49   3rd Qu.: 306.5  
##                     Max.   :200.0   Max.   :2249.99   Max.   :2801.0  
##                                                                       
##  x4StarReviews    x3StarReviews    x2StarReviews    x1StarReviews    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :   0.00  
##  1st Qu.:  2.75   1st Qu.:  2.00   1st Qu.:  1.00   1st Qu.:   2.00  
##  Median : 22.00   Median :  7.00   Median :  3.00   Median :   8.50  
##  Mean   : 40.20   Mean   : 14.79   Mean   : 13.79   Mean   :  37.67  
##  3rd Qu.: 33.00   3rd Qu.: 11.25   3rd Qu.:  7.00   3rd Qu.:  15.25  
##  Max.   :431.00   Max.   :162.00   Max.   :370.00   Max.   :1654.00  
##                                                                      
##  PositiveServiceReview NegativeServiceReview Recommendproduct
##  Min.   :  0.00        Min.   :  0.000       Min.   :0.100   
##  1st Qu.:  2.00        1st Qu.:  1.000       1st Qu.:0.700   
##  Median :  5.50        Median :  3.000       Median :0.800   
##  Mean   : 51.75        Mean   :  6.225       Mean   :0.745   
##  3rd Qu.: 42.00        3rd Qu.:  6.250       3rd Qu.:0.900   
##  Max.   :536.00        Max.   :112.000       Max.   :1.000   
##                                                              
##  BestSellersRank ShippingWeight     ProductDepth      ProductWidth   
##  Min.   :    1   Min.   : 0.0100   Min.   :  0.000   Min.   : 0.000  
##  1st Qu.:    7   1st Qu.: 0.5125   1st Qu.:  4.775   1st Qu.: 1.750  
##  Median :   27   Median : 2.1000   Median :  7.950   Median : 6.800  
##  Mean   : 1126   Mean   : 9.6681   Mean   : 14.425   Mean   : 7.819  
##  3rd Qu.:  281   3rd Qu.:11.2050   3rd Qu.: 15.025   3rd Qu.:11.275  
##  Max.   :17502   Max.   :63.0000   Max.   :300.000   Max.   :31.750  
##  NA's   :15                                                          
##  ProductHeight     ProfitMargin        Volume     
##  Min.   : 0.000   Min.   :0.0500   Min.   :    0  
##  1st Qu.: 0.400   1st Qu.:0.0500   1st Qu.:   40  
##  Median : 3.950   Median :0.1200   Median :  200  
##  Mean   : 6.259   Mean   :0.1545   Mean   :  705  
##  3rd Qu.:10.300   3rd Qu.:0.2000   3rd Qu.: 1226  
##  Max.   :25.800   Max.   :0.4000   Max.   :11204  
##

Search for duplicated rows

First of all I’m going to check if the Current_Products dataset contains any duplicate entries. To do this I will only consider Product Type, Star reviews and Service reviews, I don’t want to consider Price, ProductNum, ProfitMargin or Volume because those variables might change overtime in the same product.

And we found just that, 8 different entries of ExtendedWarranty have different prices but the same amount of reviews of each type and the same volume. This indicates to me that the same warranty has been sold under different prices, this makes sense since the same warranty can vary in price depending on which products its been applied to.

ReviewCols <- grep("Review|Type", names(Current_Products), value = TRUE)

idx <- duplicated(Current_Products[, c(ReviewCols)]) |
  duplicated(Current_Products[, c(ReviewCols)], fromLast = TRUE)

duplicated <- Current_Products[idx, ]

duplicated[, c("ProductType", "Price", "x5StarReviews", "x4StarReviews", "PositiveServiceReview", "Volume")] %>%
  kable() %>%
  kable_styling()

ProductType	Price	x5StarReviews	x4StarReviews	PositiveServiceReview	Volume
ExtendedWarranty	124.98	308	27	280	1232
ExtendedWarranty	129.98	308	27	280	1232
ExtendedWarranty	134.98	308	27	280	1232
ExtendedWarranty	151.98	308	27	280	1232
ExtendedWarranty	169.98	308	27	280	1232
ExtendedWarranty	179.98	308	27	280	1232
ExtendedWarranty	189.50	308	27	280	1232
ExtendedWarranty	349.99	308	27	280	1232

Since I don’t have information on which product each warranty it’s been applied to I can’t consider this found duplicated rows as different so I will merge them into just 1 row with the mean price, then add that row to the Current_Products dataset and remove the 8 duplicated rows.

Merge duplicated rows by price mean into a single row.

duplicated$Price <- mean(duplicated$Price)
duplicated <- duplicated[c(1), ]

Remove duplciated rows.

Current_Products <- Current_Products[!idx, ]

Add merged row.

Current_Products <- rbind(Current_Products, duplicated)

Next thing is check how many missing values, if any, each variable has.

NAColumns <- colnames(Current_Products)[colSums(is.na(Current_Products)) > 0]

for (i in NAColumns) {
  print(paste(i, sum(is.na(Current_Products[, i]))))
}

## [1] "BestSellersRank 15"

The dataset only has one variable with missing values. Since it has a lot of them, its not a characteristic atribute of the product, and we didn’t get a proper explanation of this variable to understand it’s meaning, I’m removing it.

Current_Products$BestSellersRank <- NULL

Dummify ProductType

Another thing that I wanted to do was convert ProductType from a categoeical variable into a numerical one. This will basicaly be achieved by adding a column for each ProductType and put value 1 or 0 depending if the given entry is from this ProductType or not.

This will be usefull later on since some steps requiere numerical variables and don’t function with categorical ones.

Current_Products$ProductType <- as.factor(Current_Products$ProductType)
DummyC <- dummyVars(" ~ .", data = Current_Products, levelsOnly = TRUE)
Current_Products_Dummy <- data.frame(predict(DummyC, newdata = Current_Products))

New_Products$ProductType <- as.factor(New_Products$ProductType)
DummyN <- dummyVars(" ~ .", data = New_Products, levelsOnly = TRUE)
New_Products_Dummy <- data.frame(predict(DummyN, newdata = New_Products))

Correlation matrix

Here we can see how much each variable explains of eachother, since we are trying to predict Volume, the ones with higher correlation to Volume will be the most relevant.

Since we dividied ProductType by category, we can only see how each category is correlated with the other variables but not ProductType as a whole.

As it could be expected, the review features are the most related to Volume.

corrData <- cor(Current_Products_Dummy)
corrplot(corrData, type = "lower", tl.col = "black", tl.srt = 10, tl.cex = 0.8)

Feature selection

Star reviews have a high correlation with Volume but also within them so I will have to remove some of them in order to break that colinearity. I was planing on keeping the 5 star and 1 star reviews as those were the extremes, but 5 star reviews have a perfect correlation with volume, and 2 star reviews has a higher correlation with volume than 1 star reviews, so instead I will be keeping 4 star and 2 star reviews.

ProfitMargin removed because its not a value that can be perceived by the final client.

ProductNum removed since it is an index and is not perceived by the final client.

Current_Products_Dummy <- subset(Current_Products_Dummy,
                                 select = -c(x5StarReviews, x3StarReviews, x1StarReviews,
                                             ProductNum, ProfitMargin))

New_Products_Dummy <- subset(New_Products_Dummy,
                             select = -c(x5StarReviews, x3StarReviews, x1StarReviews,
                                         ProductNum, ProfitMargin))

Check outliers

Another step is to check for outliers in the variable that we want to predict, in this case, Volume.

outliers <- boxplot(Current_Products_Dummy$Volume)$out

outliers

## [1]  2052  2140 11204  1896  7036  1684

We found some outliers, I’m going to save the rows with outliers to add them later for a descriptive analysis of the dataset.

Current_Products_Dummy_Out <-
  Current_Products_Dummy[which(Current_Products_Dummy$Volume %in% outliers), ]

For the modeling part I will choose to remove all the rows with outliers, this will help the predictive model by shortening the range of the variable.

Current_Products_Dummy <-
  Current_Products_Dummy[-which(Current_Products_Dummy$Volume %in% outliers), ]

Since I removed some rows, its better to check if in the process I removed some variance from the different variables.

names(which(apply(Current_Products_Dummy, 2, var) == 0))

## [1] "GameConsole"

And indeed it happened, all GameConsole observations where outliers, so now that we removed the found outliers this variable contains no data or variance. I could add again the GameConsole observations that i removed but I will choose to remove that variable instead.

Current_Products_Dummy$GameConsole <- NULL
Current_Products_Dummy_Out$GameConsole <- NULL

Modeling Volume prediction Define Train and Test sets

For the modeling part I will split the Current_Products dataset in 2, I will use 75% of it to train the predictive model and 25% of it to test how well the predictive model performs. Then I will repeat the train and test steps with different models and use their error metrics to choose the best performing model and make predictions on the New_Products dataset to select the most profitable products.

set.seed(123);inTraining <-
  createDataPartition(Current_Products_Dummy$Volume, p = .75, list = FALSE)

training <- Current_Products_Dummy[inTraining, ]
testing <- Current_Products_Dummy[-inTraining, ]

Since we don’t have many observations I will add a cross-validation step to make the error metrics more robust.

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

Random forest

set.seed(123);rfFit1 <-
  train(Volume ~., data = training, method = "rf", trControl = fitControl, metric = "MAE")

RF_Train <- round(postResample(predict(rfFit1, training), training$Volume), 2)

PredictionsrfFit1 <- predict(rfFit1, testing)
RF_Test <- round(postResample(PredictionsrfFit1, testing$Volume), 2)

SVM

set.seed(123);svmFit1 <- svm(Volume ~., data = training, fitted = TRUE, kernel = "linear")

SVM_Train <- round(postResample(predict(svmFit1, training), training$Volume), 2)

PredictionssvmFit1 <- predict(svmFit1, testing)
SVM_Test <- round(postResample(PredictionssvmFit1, testing$Volume), 2)

KNN

set.seed(123);knnFit1 <-
  train(Volume ~., data = training, method = "knn", trControl = fitControl)

KNN_Train <- round(postResample(predict(knnFit1, training), training$Volume), 2)

PredictionsknnfFit1 <- predict(knnFit1, testing)
KNN_Test <- round(postResample(PredictionsknnfFit1, testing$Volume), 2)

Error metrics table

Here we have the error metrics of each model, although the random forest model experiments a higher drop in performance between train and test sets it has the better metrics overall, so even though it overfits it is still performaing better than models that are good generalizing.

metrics <- data.frame(matrix(ncol = 5, nrow = 0))
colnames(metrics) <- c("RMSE", "Rsquared", "MAE", "Model", "Set")

metrics[nrow(metrics) + 1, ] <- c(RF_Train, "RF", "Train")
metrics[nrow(metrics) + 1, ] <- c(RF_Test, "RF", "Test")
metrics[nrow(metrics) + 1, ] <- c(SVM_Train, "SVM", "Train")
metrics[nrow(metrics) + 1, ] <- c(SVM_Test, "SVM", "Test")
metrics[nrow(metrics) + 1, ] <- c(KNN_Train, "KNN", "Train")
metrics[nrow(metrics) + 1, ] <- c(KNN_Test, "KNN", "Test")

metrics

##     RMSE Rsquared    MAE Model   Set
## 1  86.51     0.97  44.29    RF Train
## 2 142.12     0.88  82.12    RF  Test
## 3 277.71     0.66 129.47   SVM Train
## 4 141.13     0.92 113.43   SVM  Test
## 5 291.59     0.64 175.72   KNN Train
## 6 239.45     0.73 185.28   KNN  Test

Now I will use the choosen model to make predictions on the New_Products dataset.

NewpPredictionsrfFit1 <- predict(rfFit1, New_Products_Dummy)

Add the predicted Volume to New_Products.

New_Products["Volume"] <- round(NewpPredictionsrfFit1, 0)

Calculate predicted Profit.

New_Products["Profit"] <-
  round(New_Products$ProfitMargin * New_Products$Volume * New_Products$Price, 0)

And here we have the top 10 most profitable products, by the numbers it would be a good idea to select the top 3 products since they are from different Producttypes and won’t compete against eachother. The fourth one it’s from the same category as the third and as I already mentioned it won’t be a good idea to release more than one product with the same ProductType, and the fifth one provides less than half the profit of the third so we stop here.

New_Products_T <- subset(New_Products, select = c(ProductType, ProductNum, Price,
                                                  ProfitMargin, Volume, Profit))
New_Products_T <- New_Products_T[order(New_Products_T$Profit, decreasing = TRUE), ]

New_Products_T[1:10, ] %>%
  kable() %>%
  kable_styling()

ProductType	ProductNum	Price	ProfitMargin	Volume	Profit
GameConsole	307	425.00	0.18	1153	88204
PC	171	699.00	0.25	443	77414
Tablet	186	629.00	0.10	1068	67177
Tablet	187	199.00	0.20	1137	45253
Netbook	180	329.00	0.09	1004	29728
GameConsole	199	249.99	0.09	1175	26436
PC	172	860.00	0.20	119	20468
Laptop	173	1199.00	0.10	165	19784
Printer	304	199.99	0.90	80	14399
Laptop	175	1199.00	0.15	43	7734

Here we can see the predicted Profit by ProductType, GameConsole, Tablet and PC are the most profitable ones by a wide margin.

ggplot(New_Products,
       aes(x = New_Products$ProductType,
           y = New_Products$Profit,
           fill = New_Products$ProductType)) +
  geom_bar(stat = "identity") +
  labs(fill = "Product Type", title = "Total Predicted Profit Per Product Type") +
  scale_x_discrete(name = "Product Type") +
  scale_y_continuous(name = "Profit in Eur", breaks = seq(0, 120000, 20000)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

And here the predicted Volume by ProductType, GameConsole and Tablet are clear first followed by Netbook and Smartphone. Although this last 2 ProductTypes didn’t provide much profit since they have a small margin. Contrary of that, although PC provided a lot of profit it didn’t sell that well.

ggplot(New_Products,
       aes(x = New_Products$ProductType,
           y = New_Products$Volume,
           fill = New_Products$ProductType)) +
  geom_bar(stat = "identity") +
  labs(fill = "Product Type" ,title = "Total Predicted Volume Per Product Type") +
  scale_x_discrete(name = "Product Type") +
  scale_y_continuous(name = "Volume", breaks = seq(0, 4500, 500)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Descriptive analysis

As I mentioned before, I’m going to add the outliers since it’s better to have all the data available in order to make descriptive plots.

Current_Products_Dummy <- rbind(Current_Products_Dummy, Current_Products_Dummy_Out)

Here we can see the proportion of star reviews received in every ProductType. Generaly speaking 5 star reviews are the most common amongst every ProductType by a large margin, while the other reviews only compose a small part of the total reviews. Although there are ProductTypes with a lower percentage of 5 star reviews, the other insight worth noticing is that Software has more 1 star reviews than any other kind of review, the product is not well received.

Current_Products_Star <- Current_Products[, c("ProductType", "x5StarReviews",
                                              "x4StarReviews", "x3StarReviews",
                                              "x2StarReviews", "x1StarReviews")] %>%
  melt(id.vars = "ProductType") %>%
  group_by(ProductType, variable) %>% 
  summarise(value = sum(value)) %>%
  mutate(freq = value / sum(value))

ggplot(Current_Products_Star, aes(ProductType, freq)) +
  geom_bar(aes(fill = variable), position = "dodge", stat = "identity") +
  labs(fill = "Star Reviews", title = "Star Reviews Per Product Type") +
  scale_x_discrete(name = "Product Type") +
  scale_y_continuous(name = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

And here the proportion of service reviews received in every ProductType. Again, generaly speaking most ProductTypes have more positive than negative service reviews. Being Accessories and ExtendedWarranty the ones with most positives reviews and Laptop, Netbook and Printer the ones with most negative reviews, in fact they have more negative than positive reviews, that speaks very bad of the service behind this products.

Current_Products_Review <- Current_Products[, c("ProductType", "PositiveServiceReview",
                                                "NegativeServiceReview")] %>%
  melt(id.vars = "ProductType") %>%
  group_by(ProductType, variable) %>%
  summarise(value = sum(value)) %>%
  mutate(freq = value / sum(value))

ggplot(Current_Products_Review, aes(ProductType, freq)) +
  geom_bar(aes(fill = variable), position = "dodge", stat = "identity") +
  labs(fill = "Service Reviews", title = "Service Reviews Per Product Type") +
  scale_x_discrete(name = "Product Type") +
  scale_y_continuous(name = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

And Here to put the 2 previous plots in context, the total Volume sold by ProductType. Accessories is the most sold by a wide margin, it is a bit suprising to see Software as the third best selling ProductType having that many 1 star reviews as we have seen before. Without having market knowledge we can only speculate about why this happens but it might be due to captive audience or a monopoly situation with a given product.

Current_Products_Sales <- subset(Current_Products, select = c(Volume, ProductType))

Current_Products_Sales <-
  aggregate(Current_Products_Sales$Volume,
            by = list(Category = Current_Products_Sales$ProductType), FUN = sum)

ggplot(Current_Products_Sales, aes(x = reorder(Category, -x), y = x)) +
  geom_bar(stat = "identity", fill = "tomato1", colour = "black") +
  labs(fill = "Product Type", title = "Blackwell Sales by Product Type") +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "", breaks = seq(0, 25000, 5000)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5),
        panel.grid  = element_blank(),
        panel.background = element_rect(fill = "transparent"),
        axis.ticks = element_blank())

Market basket analysis of competitor data

And finaly I will proceed with the market basket analysis of the transactional data provided by the competitor in order to see if its a good idea to acquire this company or not.

Getting to know the data

This data it’s not in the form of a table like in the previous case, it’s a transactional dataset, contains information about what products are in each purchase. We can see that there are 9.835 transactions and a total of 125 different products. We can also see the most sold items and the transaction sizes, but I will explore them further with more visual plots next.

summary(Transactions)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  125 columns (items) and a density of 0.03506172 
## 
## most frequent items:
##                     iMac                HP Laptop CYBERPOWER Gamer Desktop 
##                     2519                     1909                     1809 
##            Apple Earpods        Apple MacBook Air                  (Other) 
##                     1715                     1530                    33622 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    2 2163 1647 1294 1021  856  646  540  439  353  247  171  119   77   72 
##   15   16   17   18   19   20   21   22   23   25   26   27   29   30 
##   56   41   26   20   10   10   10    5    3    1    1    3    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.383   6.000  30.000 
## 
## includes extended item information - examples:
##                             labels
## 1 1TB Portable External Hard Drive
## 2 2TB Portable External Hard Drive
## 3                   3-Button Mouse

Plot transaction size

Here we can see more clearly how the amount of transactions decays as the number of items increases. As expected there are more transactions with few products than transactions with many products, but still there are a lot of transactions with many products, this might indicate that it’s clients are retailers.

transactionSize <- size(Transactions)

h <- hist(transactionSize, col = "turquoise3", breaks = 30, xaxt = "n", ylim = c(0, 2500),
          xlab = "Transaction Size", main = "Transaction Size Histogram", las = 1)
axis(1, at = seq(0, 30, by = 1))
text(h$mids,h$counts,labels = h$counts, adj = c(0, -0.5), srt = 45)

Plot item frequency

And here the top 10 most sold items of that company. For the most part those items are different kinds of computers rather than accessories or peripherics like our initial company.

itemFrequencyPlot(Transactions, type = c("absolute"), topN = 10, main = "Most Sold Items",
                  ylab = "", col = "turquoise3", yaxt = "n")
axis(2, at = seq(0, 2500, by = 500), tick = FALSE, las = 2, line = -1)

Generate association rules for items

Now i’m going to generate association rules containing those items, this will tell me if there are items that are more frequently bought together than separate.

Item_Rules <- apriori(Transactions, parameter = list(supp = 0.004, conf = 0.65, minlen = 2))

Item_Rules <- sort(subset(Item_Rules, subset = lift > 2.7), by = "lift")

Remove redundant rules if any.

Item_Rules <- Item_Rules[!is.redundant(Item_Rules)]

Plot item rules

Here I filtered a lot of the rules to just have a few relevant ones, we can see that many of the combinations of different computers and monitors lead to buy either an HP Laptop or an iMac.

plot(Item_Rules, method = "graph", control = list(type = "items"), shading = "confidence")

plot(Item_Rules, method = "paracoord", control = list(reorder = TRUE))

so puting it together, we saw that there are more transactions than expected with many different products in them, that the most sold products are different kinds of computers, and that those products are frequently bought together. At this point our initial guess that this company might be oriented at retailers rather than the final user might be accurate.

Add ProductType

Adding the ProductType category in the transactional data in order to do the same analysis that I did before with items but with ProductType instead.

Items_Sheet <- Items_Sheet[order(Items_Sheet$Item), ]

Transactions_ProductType <- aggregate(Transactions, Items_Sheet$Category)

Plot ProductType frequency

Here we can see the most sold categories. This plot can be compared with the one from our initial analysisi of the other company, and we can clearly see that they complement eachother very well.

itemFrequencyPlot(Transactions_ProductType, type = c("absolute"), topN = 17,
                  main = "Electronidex Sales by Product Type", ylab = "",
                  col = "turquoise3", yaxt = "n")
axis(2, at = seq(0, 6000, by = 1000), tick = FALSE, las = 2, line = -1)

Generate association rules for product types

Next I will generate the association rules for those categories.

ProductType_Rules <- apriori (Transactions_ProductType,
                              parameter = list(supp = 0.08, conf = 0.5, minlen = 2))

Remove redundant rules if any

ProductType_Rules <- ProductType_Rules[!is.redundant(ProductType_Rules)]

Plot product type rules

Here since the different Product types englobe a lot of products I opted to have more rules and see the big picture of it to complement the precise product information provided with the previos steps.

Monitors, Laptops and Desktops are at the center of this plots, confirming that what we already discovered it’s not due to a specific product being popular but the whole Product type.

plot(ProductType_Rules, method = "graph", control = list(type = "items"))

plot(ProductType_Rules, method = "paracoord",
     control = list(reorder = TRUE), main = "Frequently sold together")

Conclusion

At the begining I mentioned that this analysis had two goals so I’m going to comment on both of them separately.

Profitability of new products

I created a predictive model based on the current product range in order to predict the sales volume of new products and choose the most profitable ones to put on sale.

The model performed good on our train and test steps so we can trust it’s predictions, and choose the most profitable products that wouldn’t compete against eachother and complemented the current product range.

ProductType	ProductNum	Price	ProfitMargin	Volume	Profit
GameConsole	307	425	0.18	1153	88204
PC	171	699	0.25	443	77414
Tablet	186	629	0.10	1068	67177
Tablet	187	199	0.20	1137	45253
Netbook	180	329	0.09	1004	29728

Market basket analysis of a competitor

Here I analysed the transactional data of said competitor in order to see if it made sense to acquire the company and my conclusion is that it does. Both companyes compliment very well its products, one is good at selling computers and the other at selling computer accessories and peripherics.