Predicting volume sales for new stock items and calculating their profitability

Summary of the project

Exploratory and Machine Learning process in order to predict the volume sales of new products so we can calculate the profitability of those new items introduced to the company portfolio.

Looking at the structure of the data

glimpse(existing)
## Observations: 80
## Variables: 18
## $ Product.Type                     <fct> PC, PC, PC, Laptop, Laptop, A...
## $ Product..                        <int> 101, 102, 103, 104, 105, 106,...
## $ Price                            <dbl> 949.00, 2249.99, 399.00, 409....
## $ X5.Star.Reviews                  <int> 3, 2, 3, 49, 58, 83, 11, 33, ...
## $ X4.Star.Reviews                  <int> 3, 1, 0, 19, 31, 30, 3, 19, 9...
## $ X3.Star.Reviews                  <int> 2, 0, 0, 8, 11, 10, 0, 12, 2,...
## $ X2.Star.Reviews                  <int> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0,...
## $ X1.Star.Reviews                  <int> 0, 0, 0, 9, 36, 40, 1, 9, 2, ...
## $ Positive.Service.Review          <int> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2...
## $ Negative.Service.Review          <int> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0...
## $ Would.consumer.recommend.product <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3,...
## $ Best.Sellers.Rank                <int> 1967, 4806, 12076, 109, 268, ...
## $ Shipping.Weight..lbs.            <dbl> 25.80, 50.00, 17.40, 5.70, 7....
## $ Product.Depth                    <dbl> 23.94, 35.00, 10.50, 15.00, 1...
## $ Product.Width                    <dbl> 6.62, 31.75, 8.30, 9.90, 0.30...
## $ Product.Height                   <dbl> 16.89, 19.00, 10.20, 1.30, 8....
## $ Profit.margin                    <dbl> 0.15, 0.25, 0.08, 0.08, 0.09,...
## $ Volume                           <int> 12, 8, 12, 196, 232, 332, 44,...

List of “TO DO” things to start with:

  1. Adapting attribute names to something more comfortable
  2. ProductID to factor & cheking it is unique for each observation
  3. Best.Seller.Rank to factor and studying NA values
  4. Depth, Width and Height to create Surface
  5. LATER: Applying the same transformations to dataset NEW
names(existing) = c("type", "id", "price", "stars_5", "stars_4", "stars_3", "stars_2", "stars_1", "pos_service", "neg_service", "would_recommend", "bestSeller_rank", "weight", "depth", "width", "height", "profit_margin", "volume")

existing$id = factor(existing$id)
existing$bestSeller_rank = factor(existing$bestSeller_rank)

existing$surface = (existing$depth^2)+(existing$height^2)+(existing$width^2)

glimpse(existing)
## Observations: 80
## Variables: 19
## $ type            <fct> PC, PC, PC, Laptop, Laptop, Accessories, Acces...
## $ id              <fct> 101, 102, 103, 104, 105, 106, 107, 108, 109, 1...
## $ price           <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99, 114....
## $ stars_5         <int> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, 1...
## $ stars_4         <int> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8, 62...
## $ stars_3         <int> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13, ...
## $ stars_2         <int> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, 7, 2...
## $ stars_1         <int> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, 16, ...
## $ pos_service     <int> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 57...
## $ neg_service     <int> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3, 3, ...
## $ would_recommend <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, 0...
## $ bestSeller_rank <fct> 1967, 4806, 12076, 109, 268, 64, NA, 2, NA, 18...
## $ weight          <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7.30, 1...
## $ depth           <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80, 6.70,...
## $ width           <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10.30, 6....
## $ height          <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 11.50, ...
## $ profit_margin   <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05, 0.05...
## $ volume          <int> 12, 8, 12, 196, 232, 332, 44, 132, 64, 40, 84,...
## $ surface         <dbl> 902.2201, 2594.0625, 283.1800, 324.7000, 245.7...

Looking at the summary of the attributes

summary(existing)
##                 type          id         price            stars_5      
##  Accessories      :26   101    : 1   Min.   :   3.60   Min.   :   0.0  
##  Printer          :12   102    : 1   1st Qu.:  52.66   1st Qu.:  10.0  
##  Extended Warranty:10   103    : 1   Median : 132.72   Median :  50.0  
##  Software         : 6   104    : 1   Mean   : 247.25   Mean   : 176.2  
##  Display          : 5   105    : 1   3rd Qu.: 352.49   3rd Qu.: 306.5  
##  PC               : 4   106    : 1   Max.   :2249.99   Max.   :2801.0  
##  (Other)          :17   (Other):74                                     
##     stars_4          stars_3          stars_2          stars_1       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :   0.00  
##  1st Qu.:  2.75   1st Qu.:  2.00   1st Qu.:  1.00   1st Qu.:   2.00  
##  Median : 22.00   Median :  7.00   Median :  3.00   Median :   8.50  
##  Mean   : 40.20   Mean   : 14.79   Mean   : 13.79   Mean   :  37.67  
##  3rd Qu.: 33.00   3rd Qu.: 11.25   3rd Qu.:  7.00   3rd Qu.:  15.25  
##  Max.   :431.00   Max.   :162.00   Max.   :370.00   Max.   :1654.00  
##                                                                      
##   pos_service      neg_service      would_recommend bestSeller_rank
##  Min.   :  0.00   Min.   :  0.000   Min.   :0.100   7      : 8     
##  1st Qu.:  2.00   1st Qu.:  1.000   1st Qu.:0.700   1      : 3     
##  Median :  5.50   Median :  3.000   Median :0.800   11     : 3     
##  Mean   : 51.75   Mean   :  6.225   Mean   :0.745   2      : 2     
##  3rd Qu.: 42.00   3rd Qu.:  6.250   3rd Qu.:0.900   3      : 2     
##  Max.   :536.00   Max.   :112.000   Max.   :1.000   (Other):47     
##                                                     NA's   :15     
##      weight            depth             width            height      
##  Min.   : 0.0100   Min.   :  0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.5125   1st Qu.:  4.775   1st Qu.: 1.750   1st Qu.: 0.400  
##  Median : 2.1000   Median :  7.950   Median : 6.800   Median : 3.950  
##  Mean   : 9.6681   Mean   : 14.425   Mean   : 7.819   Mean   : 6.259  
##  3rd Qu.:11.2050   3rd Qu.: 15.025   3rd Qu.:11.275   3rd Qu.:10.300  
##  Max.   :63.0000   Max.   :300.000   Max.   :31.750   Max.   :25.800  
##                                                                       
##  profit_margin        volume         surface        
##  Min.   :0.0500   Min.   :    0   Min.   :    0.00  
##  1st Qu.:0.0500   1st Qu.:   40   1st Qu.:   68.18  
##  Median :0.1200   Median :  200   Median :  175.10  
##  Mean   :0.1545   Mean   :  705   Mean   : 1606.03  
##  3rd Qu.:0.2000   3rd Qu.: 1226   3rd Qu.:  584.58  
##  Max.   :0.4000   Max.   :11204   Max.   :90000.50  
## 

Do all the products have unique ID’s?

if (length(unique(existing$id)) == nrow(existing)) {
  print(paste("There are", length(unique(existing$id)), "unique ID values out of", nrow(existing), "observations."))
  } else { (print("CHECK: There is at least 1 repeated ID value in the 80 observations."))
}
## [1] "There are 80 unique ID values out of 80 observations."

Is the best sellers rank field meaningful?

barplot(table(existing$bestSeller_rank), main = "Amount of products\nby rank", col = "purple", xlab = "Rank", ylab = "Number of products")

sum(is.na(existing$bestSeller_rank))
## [1] 15

Rank value is not unique for each product. Attribute meaning is not clear. Later on, we will decide if the attribute is worth keeping or not, by applying a matrix correlation and a decision tree. Also, there are 15 NA values.

Price, Service review and User Recommendation histograms

par(mfrow = c(2,2))

hist(existing$price, col = "purple", xlab = "Price €", main = "Histogram of Price")
hist(existing$would_recommend, col = "purple", xlab = "Would recommend?\n0 No, 1 Yes", main = "Histogram of user recommend")
hist(existing$pos_service, col = "purple", xlab = "Positive review", main = "Histogram of positive service")
hist(existing$neg_service, col = "purple", xlab = "Negative review", main = "Histogram of negative service")

par(mfrow = c(1,1))
  • Most of the products cost up to 500 €, there are some products which cost more than 1000 €
  • Most of the products are recommended by the users
  • Overall, there are more positive reviews for service than negative ones

Studying the dependent variable: VOLUME

boxplot(existing$volume, col = "purple", varwidth = F)

We can see there are a couple of outliers. Let’s check which type of items are they.

existing %>%
  filter(volume >= 2500) %>%
  select(type, volume) %>%
  knitr::kable()
type volume
Accessories 11204
Game Console 7036

An accesory and a game console are selling many units. We will not remove these outliers right now. When doing the machine learning process to predict Volume, the error metrics using the outliers and removing them will be checked in order to see what works better.

Correlation matrices

Correlation matrix ALL attributes

mat = as.matrix(existing[, -c(1:2, 12)])
corr_mat = cor(mat)
corrplot(corr = corr_mat, diag = F)

The attributes that have bigger correlation with Volume are Star Reviews and positive service. Keep in mind that correlation here is linear, any quadratic correlation will not be shown in this matrix.

Also, it looks like there is collinearity between Star reviews attributes. Further investigation and decisions will avoid bias on our model later on.

Correlation matrix Star reviews and Volume

Studying correlation and collinearity

mat_stars = as.matrix(existing[,c(4:8, 18)])
star_corr_mat = cor(mat_stars)
corrplot(corr = star_corr_mat, diag = F, method = "number")

Volume and 5 star have a too-high correlation index. Basically, a correlation of 1 means that knowing the amount of 5 stars reviews of an item, we now exactly its sales volume, which does not make any sense in the real world. It is true that the more positive reviews an item has, the more it will sell, but not to the point where the amount of sales for any item is determined 100% on its 5 star reviews. In short words, there is something funny on the 5 star reviews values and we cannot use this data for later predictions.

For tat reason, 5 star reviews attribute will be removed completely, and some feature engineering will be carried on the other star reviews attributes, aswell as in other fields, like we did creating the Surface attribute using the Depth, Height and Width.

Feature engineering will allow us to avoid collinearity among attributes and improve correlation and machine learning processes.

Feature engineering

Let’s create a bunch of attributes and see if they have better a correlation with “volume”

existing = existing[, -c(2, 4, 13:16)] #Removing attributes that are not useful

existing$reviews_43 = existing$stars_4 + existing$stars_3
existing$reviews_21 = existing$stars_2 + existing$stars_1
existing$reviews_4321 = existing$stars_4 + existing$stars_3 + existing$stars_2 + existing$stars_1
existing$total_service_reviews = existing$pos_service + existing$neg_service
existing$total_positive_reviews = existing$stars_4 + existing$stars_3 + existing$pos_service
existing$total_negative_reviews = existing$stars_2 + existing$stars_1 + existing$neg_service

names(existing)
##  [1] "type"                   "price"                 
##  [3] "stars_4"                "stars_3"               
##  [5] "stars_2"                "stars_1"               
##  [7] "pos_service"            "neg_service"           
##  [9] "would_recommend"        "bestSeller_rank"       
## [11] "profit_margin"          "volume"                
## [13] "surface"                "reviews_43"            
## [15] "reviews_21"             "reviews_4321"          
## [17] "total_service_reviews"  "total_positive_reviews"
## [19] "total_negative_reviews"
  • reviews_43 = Sum of 4 star & 3 star attributes
  • reviews_21 = Sum of 2 star & 1 star attributes
  • reviews_4321 = Sum of 4, 3, 2 & 1 star attributes
  • total_service_reviews = Sum of positive & negative service attributes
  • total_positive_reviews = Sum of 4, 3 star reviews and positive service
  • total_negative_reviews = Sum of 2, 1 star reviews and negative service

Correlation matrix with newly created attributes

mat_stars = as.matrix(existing[, -c(1, 10)])
star_corr_mat = cor(mat_stars)
corrplot(corr = star_corr_mat, diag = F, method = "number", number.cex = 0.5)

Let’s study this correlation matrix:

There are a lot of attributes now, correlated somehow with volume. We need to filter which attributes will be the best ones to work with for volume prediction.

Higher correlated attributes with volume are 4 star, 3 star, positive service, 4+3 star, total_service and total_positive. What happens here is that we have a collinearity issue we have to fix here.

4 star is very correlated with volume, but also with 3 star, 4+3 star & total positive reviews. This means that these variables “explain” the same information. We do not want to keep all of them to avoid biasing our models later. We want to keep as much of that information as possible, so, we will keep the attribute that is more corretaled with volume that has no high correlation with other attributes at the same time. This means that, in this case, we will stick with 4 star and will not use 3 star, 4+3 star & total positive review.

Must be said, though, that a correlation matrix only shows linear correlation between attributes, meaning that if there is some kind of quadratic correlation between volume and other attribute, it will not appear in the matrix correlation (correlation will be 0 or close to 0). In order to study other kinds of correlation rather than linear, we can use a Random Forest.

Let’s build a Random Forest in order to see how the attributes distribute the information for the variable volume.

Attributes used are price + stars_4 + pos_service + neg_service + would_recommend + reviews_21

summary_RF_some$variable.importance
##     stars_4  reviews_21 neg_service pos_service 
##    78458578    59323385    27749245    16270987
rpart.plot(RF_some, type = 5, shadow.col = "black")

We can see that the variable importance for the RF plot above defines 4 star, 2+1 reviews, neg_service & pos_service as the most important ones. This only means that each of these attributes appears many times in the Random Forest (they each appear many times in each tree in the random forest).

What really tells about importance is the plot, that shows which attributes are the most important ones for splitting the data, and those are 4 star and positive service.

Let’s compare this random forest (built with some attributes) with a random forest using all the original attributes plus created ones at feature engineering section.

summary_RF_all$variable.importance
##             reviews_43                stars_4           reviews_4321 
##               73672994               73672994               60280502 
##             reviews_21                stars_2                stars_3 
##               52623567               52623567               52623567 
##            pos_service total_positive_reviews  total_service_reviews 
##               16270987               14356753               14356753 
##                stars_1 total_negative_reviews 
##                8614052                8614052
rpart.plot(RF_all, type = 5, shadow.col = "black")

We can see that the variable importance defines the same atributes than the last Random Forest. Also, the plot shows the same attributes for splitting the data.

Based on the correlation matrix and Random Forest approach, our final decision on which attributes are to be selected for predicting volume are 4 star and positive service.

existing_2attributes = existing %>%
  select(stars_4, pos_service, volume)

Machine Learning process. Predicting Volume sales.

We will be using LM, RF, KNN and SVM, then deciding which one works better and analysing the error later on.

Partitioning data

set.seed(123)

sample = sample.int(n = nrow(existing_2attributes),
                    size = floor(0.80*nrow(existing_2attributes)),
                    replace = F)

train = existing_2attributes[sample, ]
test = existing_2attributes[-sample, ]

Models

Linear Regression

lm_fit = lm(formula = volume~., data = train)
summary(lm_fit)
## 
## Call:
## lm(formula = volume ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1479.13   -80.99     5.07    41.03  3116.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.9281    85.1748  -0.210  0.83399    
## stars_4      12.8608     1.1887  10.819 8.05e-16 ***
## pos_service   2.3365     0.7266   3.216  0.00208 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 548.8 on 61 degrees of freedom
## Multiple R-squared:  0.7122, Adjusted R-squared:  0.7028 
## F-statistic: 75.49 on 2 and 61 DF,  p-value: < 2.2e-16

We can see how both independent variables are statistically significant, which means that changes on them affect to the dependant variable. Next step is predicting volume.

lm_pred = predict(object = lm_fit, newdata = test)
lm_pred
##          1         11         12         19         25         26 
##   25.32734   12.46651  324.62093  143.41126  397.11291 1546.33857 
##         28         29         50         53         55         61 
##   96.64093  -15.59163 6777.44571  657.70234  488.29680  418.16159 
##         62         71         79         80 
##  -17.92812  858.92427  393.57776 1168.72192

Those are the predicted volume sales for the test products using a linear model. Let’s build the other models, then we will compare the error metrics in a matrix and will proceed to th enext step.

Random Forest

rf_fit = randomForest(volume~., data = train)
rf_pred = predict(object = rf_fit, newdata = test)

K-nearest neighbours

knn_fit = caret::knnreg(formula = volume~., data = train, k = 3)
knn_pred = predict(object = knn_fit, newdata = test)

Support Vector Machine

svm_fit = svm(formula = volume~., data = train)
svm_pred = predict(object = svm_fit, newdata = test)

Error metrics comparison

error.matrix = matrix(nrow = 4, ncol = 3,
                      dimnames = list(c("LM", "RF", "KNN", "SVM"),
                                      c("RMSE", "MAE", "MDAE")))

error.matrix[1,1] = rmse(actual = test$volume, predicted = lm_pred)
error.matrix[2,1] = rmse(actual = test$volume, predicted = rf_pred)
error.matrix[3,1] = rmse(actual = test$volume, predicted = knn_pred)
error.matrix[4,1] = rmse(actual = test$volume, predicted = svm_pred)

error.matrix[1,2] = mae(actual = test$volume, predicted = lm_pred)
error.matrix[2,2] = mae(actual = test$volume, predicted = rf_pred)
error.matrix[3,2] = mae(actual = test$volume, predicted = knn_pred)
error.matrix[4,2] = mae(actual = test$volume, predicted = svm_pred)

error.matrix[1,3] = mdae(actual = test$volume, predicted = lm_pred)
error.matrix[2,3] = mdae(actual = test$volume, predicted = rf_pred)
error.matrix[3,3] = mdae(actual = test$volume, predicted = knn_pred)
error.matrix[4,3] = mdae(actual = test$volume, predicted = svm_pred)
kable(error.matrix)
RMSE MAE MDAE
LM 1197.945 549.3654 137.63725
RF 1841.609 633.0917 48.96955
KNN 2421.314 795.2125 48.66667
SVM 2516.003 841.4344 97.82619

In the end, the models are not too good predictors (then again, we only have 80 observations). Anyway, we are interested in the items for which the prediction is accurate for business purposes.

new_2attributes = new %>%
  select(X.4.Star.Reviews., X.Positive.Service.Review.) %>%
  transmute(stars_4 = as.integer(X.4.Star.Reviews.),
            pos_service = as.integer(X.Positive.Service.Review.))

— TO BE CONTINUED —

David Gibert Bosque

February, 2019