Predicting volume sales for new stock items and calculating their profitability

Summary of the project
- Looking at the structure of the data
Looking at the summary of the attributes
- Do all the products have unique ID’s?
- Is the best sellers rank field meaningful?
Price, Service review and User Recommendation histograms
Studying the dependent variable: VOLUME
Correlation matrices
- Correlation matrix ALL attributes
Correlation matrix Star reviews and Volume
- Studying correlation and collinearity
Feature engineering
Correlation matrix with newly created attributes
Let’s build a Random Forest in order to see how the attributes distribute the information for the variable volume.
Machine Learning process. Predicting Volume sales.
Partitioning data
Models
Error metrics comparison

Summary of the project

Exploratory and Machine Learning process in order to predict the volume sales of new products so we can calculate the profitability of those new items introduced to the company portfolio.

Looking at the structure of the data

glimpse(existing)

## Observations: 80
## Variables: 18
## $ Product.Type                     <fct> PC, PC, PC, Laptop, Laptop, A...
## $ Product..                        <int> 101, 102, 103, 104, 105, 106,...
## $ Price                            <dbl> 949.00, 2249.99, 399.00, 409....
## $ X5.Star.Reviews                  <int> 3, 2, 3, 49, 58, 83, 11, 33, ...
## $ X4.Star.Reviews                  <int> 3, 1, 0, 19, 31, 30, 3, 19, 9...
## $ X3.Star.Reviews                  <int> 2, 0, 0, 8, 11, 10, 0, 12, 2,...
## $ X2.Star.Reviews                  <int> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0,...
## $ X1.Star.Reviews                  <int> 0, 0, 0, 9, 36, 40, 1, 9, 2, ...
## $ Positive.Service.Review          <int> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2...
## $ Negative.Service.Review          <int> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0...
## $ Would.consumer.recommend.product <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3,...
## $ Best.Sellers.Rank                <int> 1967, 4806, 12076, 109, 268, ...
## $ Shipping.Weight..lbs.            <dbl> 25.80, 50.00, 17.40, 5.70, 7....
## $ Product.Depth                    <dbl> 23.94, 35.00, 10.50, 15.00, 1...
## $ Product.Width                    <dbl> 6.62, 31.75, 8.30, 9.90, 0.30...
## $ Product.Height                   <dbl> 16.89, 19.00, 10.20, 1.30, 8....
## $ Profit.margin                    <dbl> 0.15, 0.25, 0.08, 0.08, 0.09,...
## $ Volume                           <int> 12, 8, 12, 196, 232, 332, 44,...

List of “TO DO” things to start with:

Adapting attribute names to something more comfortable
ProductID to factor & cheking it is unique for each observation
Best.Seller.Rank to factor and studying NA values
Depth, Width and Height to create Surface
LATER: Applying the same transformations to dataset NEW

names(existing) = c("type", "id", "price", "stars_5", "stars_4", "stars_3", "stars_2", "stars_1", "pos_service", "neg_service", "would_recommend", "bestSeller_rank", "weight", "depth", "width", "height", "profit_margin", "volume")

existing$id = factor(existing$id)
existing$bestSeller_rank = factor(existing$bestSeller_rank)

existing$surface = (existing$depth^2)+(existing$height^2)+(existing$width^2)

glimpse(existing)

## Observations: 80
## Variables: 19
## $ type            <fct> PC, PC, PC, Laptop, Laptop, Accessories, Acces...
## $ id              <fct> 101, 102, 103, 104, 105, 106, 107, 108, 109, 1...
## $ price           <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99, 114....
## $ stars_5         <int> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, 1...
## $ stars_4         <int> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8, 62...
## $ stars_3         <int> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13, ...
## $ stars_2         <int> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, 7, 2...
## $ stars_1         <int> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, 16, ...
## $ pos_service     <int> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 57...
## $ neg_service     <int> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3, 3, ...
## $ would_recommend <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, 0...
## $ bestSeller_rank <fct> 1967, 4806, 12076, 109, 268, 64, NA, 2, NA, 18...
## $ weight          <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7.30, 1...
## $ depth           <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80, 6.70,...
## $ width           <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10.30, 6....
## $ height          <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 11.50, ...
## $ profit_margin   <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05, 0.05...
## $ volume          <int> 12, 8, 12, 196, 232, 332, 44, 132, 64, 40, 84,...
## $ surface         <dbl> 902.2201, 2594.0625, 283.1800, 324.7000, 245.7...

Looking at the summary of the attributes

summary(existing)

##                 type          id         price            stars_5      
##  Accessories      :26   101    : 1   Min.   :   3.60   Min.   :   0.0  
##  Printer          :12   102    : 1   1st Qu.:  52.66   1st Qu.:  10.0  
##  Extended Warranty:10   103    : 1   Median : 132.72   Median :  50.0  
##  Software         : 6   104    : 1   Mean   : 247.25   Mean   : 176.2  
##  Display          : 5   105    : 1   3rd Qu.: 352.49   3rd Qu.: 306.5  
##  PC               : 4   106    : 1   Max.   :2249.99   Max.   :2801.0  
##  (Other)          :17   (Other):74                                     
##     stars_4          stars_3          stars_2          stars_1       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :   0.00  
##  1st Qu.:  2.75   1st Qu.:  2.00   1st Qu.:  1.00   1st Qu.:   2.00  
##  Median : 22.00   Median :  7.00   Median :  3.00   Median :   8.50  
##  Mean   : 40.20   Mean   : 14.79   Mean   : 13.79   Mean   :  37.67  
##  3rd Qu.: 33.00   3rd Qu.: 11.25   3rd Qu.:  7.00   3rd Qu.:  15.25  
##  Max.   :431.00   Max.   :162.00   Max.   :370.00   Max.   :1654.00  
##                                                                      
##   pos_service      neg_service      would_recommend bestSeller_rank
##  Min.   :  0.00   Min.   :  0.000   Min.   :0.100   7      : 8     
##  1st Qu.:  2.00   1st Qu.:  1.000   1st Qu.:0.700   1      : 3     
##  Median :  5.50   Median :  3.000   Median :0.800   11     : 3     
##  Mean   : 51.75   Mean   :  6.225   Mean   :0.745   2      : 2     
##  3rd Qu.: 42.00   3rd Qu.:  6.250   3rd Qu.:0.900   3      : 2     
##  Max.   :536.00   Max.   :112.000   Max.   :1.000   (Other):47     
##                                                     NA's   :15     
##      weight            depth             width            height      
##  Min.   : 0.0100   Min.   :  0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.5125   1st Qu.:  4.775   1st Qu.: 1.750   1st Qu.: 0.400  
##  Median : 2.1000   Median :  7.950   Median : 6.800   Median : 3.950  
##  Mean   : 9.6681   Mean   : 14.425   Mean   : 7.819   Mean   : 6.259  
##  3rd Qu.:11.2050   3rd Qu.: 15.025   3rd Qu.:11.275   3rd Qu.:10.300  
##  Max.   :63.0000   Max.   :300.000   Max.   :31.750   Max.   :25.800  
##                                                                       
##  profit_margin        volume         surface        
##  Min.   :0.0500   Min.   :    0   Min.   :    0.00  
##  1st Qu.:0.0500   1st Qu.:   40   1st Qu.:   68.18  
##  Median :0.1200   Median :  200   Median :  175.10  
##  Mean   :0.1545   Mean   :  705   Mean   : 1606.03  
##  3rd Qu.:0.2000   3rd Qu.: 1226   3rd Qu.:  584.58  
##  Max.   :0.4000   Max.   :11204   Max.   :90000.50  
##

Do all the products have unique ID’s?

if (length(unique(existing$id)) == nrow(existing)) {
  print(paste("There are", length(unique(existing$id)), "unique ID values out of", nrow(existing), "observations."))
  } else { (print("CHECK: There is at least 1 repeated ID value in the 80 observations."))
}

## [1] "There are 80 unique ID values out of 80 observations."

Is the best sellers rank field meaningful?

barplot(table(existing$bestSeller_rank), main = "Amount of products\nby rank", col = "purple", xlab = "Rank", ylab = "Number of products")

sum(is.na(existing$bestSeller_rank))

## [1] 15

Rank value is not unique for each product. Attribute meaning is not clear. Later on, we will decide if the attribute is worth keeping or not, by applying a matrix correlation and a decision tree. Also, there are 15 NA values.

Price, Service review and User Recommendation histograms

par(mfrow = c(2,2))

hist(existing$price, col = "purple", xlab = "Price â¬", main = "Histogram of Price")
hist(existing$would_recommend, col = "purple", xlab = "Would recommend?\n0 No, 1 Yes", main = "Histogram of user recommend")
hist(existing$pos_service, col = "purple", xlab = "Positive review", main = "Histogram of positive service")
hist(existing$neg_service, col = "purple", xlab = "Negative review", main = "Histogram of negative service")

par(mfrow = c(1,1))

Most of the products cost up to 500 â¬, there are some products which cost more than 1000 â¬
Most of the products are recommended by the users
Overall, there are more positive reviews for service than negative ones

Studying the dependent variable: VOLUME

boxplot(existing$volume, col = "purple", varwidth = F)

We can see there are a couple of outliers. Let’s check which type of items are they.

existing %>%
  filter(volume >= 2500) %>%
  select(type, volume) %>%
  knitr::kable()

type	volume
Accessories	11204
Game Console	7036

An accesory and a game console are selling many units. We will not remove these outliers right now. When doing the machine learning process to predict Volume, the error metrics using the outliers and removing them will be checked in order to see what works better.

Correlation matrices

Correlation matrix ALL attributes

mat = as.matrix(existing[, -c(1:2, 12)])
corr_mat = cor(mat)
corrplot(corr = corr_mat, diag = F)

The attributes that have bigger correlation with Volume are Star Reviews and positive service. Keep in mind that correlation here is linear, any quadratic correlation will not be shown in this matrix.

Also, it looks like there is collinearity between Star reviews attributes. Further investigation and decisions will avoid bias on our model later on.

Correlation matrix Star reviews and Volume

Studying correlation and collinearity

mat_stars = as.matrix(existing[,c(4:8, 18)])
star_corr_mat = cor(mat_stars)
corrplot(corr = star_corr_mat, diag = F, method = "number")

Volume and 5 star have a too-high correlation index. Basically, a correlation of 1 means that knowing the amount of 5 stars reviews of an item, we now exactly its sales volume, which does not make any sense in the real world. It is true that the more positive reviews an item has, the more it will sell, but not to the point where the amount of sales for any item is determined 100% on its 5 star reviews. In short words, there is something funny on the 5 star reviews values and we cannot use this data for later predictions.

For tat reason, 5 star reviews attribute will be removed completely, and some feature engineering will be carried on the other star reviews attributes, aswell as in other fields, like we did creating the Surface attribute using the Depth, Height and Width.

Feature engineering will allow us to avoid collinearity among attributes and improve correlation and machine learning processes.

Feature engineering

Let’s create a bunch of attributes and see if they have better a correlation with “volume”

existing = existing[, -c(2, 4, 13:16)] #Removing attributes that are not useful

existing$reviews_43 = existing$stars_4 + existing$stars_3
existing$reviews_21 = existing$stars_2 + existing$stars_1
existing$reviews_4321 = existing$stars_4 + existing$stars_3 + existing$stars_2 + existing$stars_1
existing$total_service_reviews = existing$pos_service + existing$neg_service
existing$total_positive_reviews = existing$stars_4 + existing$stars_3 + existing$pos_service
existing$total_negative_reviews = existing$stars_2 + existing$stars_1 + existing$neg_service

names(existing)

##  [1] "type"                   "price"                 
##  [3] "stars_4"                "stars_3"               
##  [5] "stars_2"                "stars_1"               
##  [7] "pos_service"            "neg_service"           
##  [9] "would_recommend"        "bestSeller_rank"       
## [11] "profit_margin"          "volume"                
## [13] "surface"                "reviews_43"            
## [15] "reviews_21"             "reviews_4321"          
## [17] "total_service_reviews"  "total_positive_reviews"
## [19] "total_negative_reviews"

reviews_43 = Sum of 4 star & 3 star attributes
reviews_21 = Sum of 2 star & 1 star attributes
reviews_4321 = Sum of 4, 3, 2 & 1 star attributes
total_service_reviews = Sum of positive & negative service attributes
total_positive_reviews = Sum of 4, 3 star reviews and positive service
total_negative_reviews = Sum of 2, 1 star reviews and negative service

Correlation matrix with newly created attributes

mat_stars = as.matrix(existing[, -c(1, 10)])
star_corr_mat = cor(mat_stars)
corrplot(corr = star_corr_mat, diag = F, method = "number", number.cex = 0.5)

Let’s study this correlation matrix:

There are a lot of attributes now, correlated somehow with volume. We need to filter which attributes will be the best ones to work with for volume prediction.

Higher correlated attributes with volume are 4 star, 3 star, positive service, 4+3 star, total_service and total_positive. What happens here is that we have a collinearity issue we have to fix here.

4 star is very correlated with volume, but also with 3 star, 4+3 star & total positive reviews. This means that these variables “explain” the same information. We do not want to keep all of them to avoid biasing our models later. We want to keep as much of that information as possible, so, we will keep the attribute that is more corretaled with volume that has no high correlation with other attributes at the same time. This means that, in this case, we will stick with 4 star and will not use 3 star, 4+3 star & total positive review.

Must be said, though, that a correlation matrix only shows linear correlation between attributes, meaning that if there is some kind of quadratic correlation between volume and other attribute, it will not appear in the matrix correlation (correlation will be 0 or close to 0). In order to study other kinds of correlation rather than linear, we can use a Random Forest.

Let’s build a Random Forest in order to see how the attributes distribute the information for the variable `volume`.

Attributes used are price + stars_4 + pos_service + neg_service + would_recommend + reviews_21

summary_RF_some$variable.importance

##     stars_4  reviews_21 neg_service pos_service 
##    78458578    59323385    27749245    16270987

rpart.plot(RF_some, type = 5, shadow.col = "black")

We can see that the variable importance for the RF plot above defines 4 star, 2+1 reviews, neg_service & pos_service as the most important ones. This only means that each of these attributes appears many times in the Random Forest (they each appear many times in each tree in the random forest).

What really tells about importance is the plot, that shows which attributes are the most important ones for splitting the data, and those are 4 star and positive service.

Let’s compare this random forest (built with some attributes) with a random forest using all the original attributes plus created ones at feature engineering section.

summary_RF_all$variable.importance

##             reviews_43                stars_4           reviews_4321 
##               73672994               73672994               60280502 
##             reviews_21                stars_2                stars_3 
##               52623567               52623567               52623567 
##            pos_service total_positive_reviews  total_service_reviews 
##               16270987               14356753               14356753 
##                stars_1 total_negative_reviews 
##                8614052                8614052

rpart.plot(RF_all, type = 5, shadow.col = "black")

We can see that the variable importance defines the same atributes than the last Random Forest. Also, the plot shows the same attributes for splitting the data.

Based on the correlation matrix and Random Forest approach, our final decision on which attributes are to be selected for predicting volume are 4 star and positive service.

existing_2attributes = existing %>%
  select(stars_4, pos_service, volume)

Machine Learning process. Predicting `Volume` sales.

We will be using LM, RF, KNN and SVM, then deciding which one works better and analysing the error later on.

Partitioning data

set.seed(123)

sample = sample.int(n = nrow(existing_2attributes),
                    size = floor(0.80*nrow(existing_2attributes)),
                    replace = F)

train = existing_2attributes[sample, ]
test = existing_2attributes[-sample, ]

Models

Linear Regression

lm_fit = lm(formula = volume~., data = train)
summary(lm_fit)

## 
## Call:
## lm(formula = volume ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1479.13   -80.99     5.07    41.03  3116.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.9281    85.1748  -0.210  0.83399    
## stars_4      12.8608     1.1887  10.819 8.05e-16 ***
## pos_service   2.3365     0.7266   3.216  0.00208 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 548.8 on 61 degrees of freedom
## Multiple R-squared:  0.7122, Adjusted R-squared:  0.7028 
## F-statistic: 75.49 on 2 and 61 DF,  p-value: < 2.2e-16

We can see how both independent variables are statistically significant, which means that changes on them affect to the dependant variable. Next step is predicting volume.

lm_pred = predict(object = lm_fit, newdata = test)
lm_pred

##          1         11         12         19         25         26 
##   25.32734   12.46651  324.62093  143.41126  397.11291 1546.33857 
##         28         29         50         53         55         61 
##   96.64093  -15.59163 6777.44571  657.70234  488.29680  418.16159 
##         62         71         79         80 
##  -17.92812  858.92427  393.57776 1168.72192

Those are the predicted volume sales for the test products using a linear model. Let’s build the other models, then we will compare the error metrics in a matrix and will proceed to th enext step.

Random Forest

rf_fit = randomForest(volume~., data = train)
rf_pred = predict(object = rf_fit, newdata = test)

K-nearest neighbours

knn_fit = caret::knnreg(formula = volume~., data = train, k = 3)
knn_pred = predict(object = knn_fit, newdata = test)

Support Vector Machine

svm_fit = svm(formula = volume~., data = train)
svm_pred = predict(object = svm_fit, newdata = test)

Error metrics comparison

error.matrix = matrix(nrow = 4, ncol = 3,
                      dimnames = list(c("LM", "RF", "KNN", "SVM"),
                                      c("RMSE", "MAE", "MDAE")))

error.matrix[1,1] = rmse(actual = test$volume, predicted = lm_pred)
error.matrix[2,1] = rmse(actual = test$volume, predicted = rf_pred)
error.matrix[3,1] = rmse(actual = test$volume, predicted = knn_pred)
error.matrix[4,1] = rmse(actual = test$volume, predicted = svm_pred)

error.matrix[1,2] = mae(actual = test$volume, predicted = lm_pred)
error.matrix[2,2] = mae(actual = test$volume, predicted = rf_pred)
error.matrix[3,2] = mae(actual = test$volume, predicted = knn_pred)
error.matrix[4,2] = mae(actual = test$volume, predicted = svm_pred)

error.matrix[1,3] = mdae(actual = test$volume, predicted = lm_pred)
error.matrix[2,3] = mdae(actual = test$volume, predicted = rf_pred)
error.matrix[3,3] = mdae(actual = test$volume, predicted = knn_pred)
error.matrix[4,3] = mdae(actual = test$volume, predicted = svm_pred)

kable(error.matrix)

	RMSE	MAE	MDAE
LM	1197.945	549.3654	137.63725
RF	1841.609	633.0917	48.96955
KNN	2421.314	795.2125	48.66667
SVM	2516.003	841.4344	97.82619

In the end, the models are not too good predictors (then again, we only have 80 observations). Anyway, we are interested in the items for which the prediction is accurate for business purposes.

new_2attributes = new %>%
  select(X.4.Star.Reviews., X.Positive.Service.Review.) %>%
  transmute(stars_4 = as.integer(X.4.Star.Reviews.),
            pos_service = as.integer(X.Positive.Service.Review.))

— TO BE CONTINUED —

Predicting volume sales for new stock items and calculating their profitability

Predicting volume sales for new stock items and calculating their profitability

Summary of the project

Looking at the structure of the data

Looking at the summary of the attributes

Do all the products have unique ID’s?

Is the best sellers rank field meaningful?

Price, Service review and User Recommendation histograms

Studying the dependent variable: VOLUME

Correlation matrices

Correlation matrix ALL attributes

Correlation matrix Star reviews and Volume

Studying correlation and collinearity

Feature engineering

Correlation matrix with newly created attributes

Let’s build a Random Forest in order to see how the attributes distribute the information for the variable volume.

Machine Learning process. Predicting Volume sales.

Partitioning data

Models

Linear Regression

Random Forest

K-nearest neighbours

Support Vector Machine

Error metrics comparison

Let’s build a Random Forest in order to see how the attributes distribute the information for the variable `volume`.

Machine Learning process. Predicting `Volume` sales.