Predicting volume sales for new stock items and calculating their profitability
Predicting volume sales for new stock items and calculating their profitability
- Summary of the project
- Looking at the summary of the attributes
- Price, Service review and User Recommendation histograms
- Studying the dependent variable: VOLUME
- Correlation matrices
- Correlation matrix Star reviews and Volume
- Feature engineering
- Correlation matrix with newly created attributes
- Let’s build a Random Forest in order to see how the attributes distribute the information for the variable
volume. - Machine Learning process. Predicting
Volumesales. - Partitioning data
- Models
- Error metrics comparison
Summary of the project
Exploratory and Machine Learning process in order to predict the volume sales of new products so we can calculate the profitability of those new items introduced to the company portfolio.
Looking at the structure of the data
glimpse(existing)## Observations: 80
## Variables: 18
## $ Product.Type <fct> PC, PC, PC, Laptop, Laptop, A...
## $ Product.. <int> 101, 102, 103, 104, 105, 106,...
## $ Price <dbl> 949.00, 2249.99, 399.00, 409....
## $ X5.Star.Reviews <int> 3, 2, 3, 49, 58, 83, 11, 33, ...
## $ X4.Star.Reviews <int> 3, 1, 0, 19, 31, 30, 3, 19, 9...
## $ X3.Star.Reviews <int> 2, 0, 0, 8, 11, 10, 0, 12, 2,...
## $ X2.Star.Reviews <int> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0,...
## $ X1.Star.Reviews <int> 0, 0, 0, 9, 36, 40, 1, 9, 2, ...
## $ Positive.Service.Review <int> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2...
## $ Negative.Service.Review <int> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0...
## $ Would.consumer.recommend.product <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3,...
## $ Best.Sellers.Rank <int> 1967, 4806, 12076, 109, 268, ...
## $ Shipping.Weight..lbs. <dbl> 25.80, 50.00, 17.40, 5.70, 7....
## $ Product.Depth <dbl> 23.94, 35.00, 10.50, 15.00, 1...
## $ Product.Width <dbl> 6.62, 31.75, 8.30, 9.90, 0.30...
## $ Product.Height <dbl> 16.89, 19.00, 10.20, 1.30, 8....
## $ Profit.margin <dbl> 0.15, 0.25, 0.08, 0.08, 0.09,...
## $ Volume <int> 12, 8, 12, 196, 232, 332, 44,...
List of “TO DO” things to start with:
- Adapting attribute names to something more comfortable
- ProductID to factor & cheking it is unique for each observation
- Best.Seller.Rank to factor and studying NA values
- Depth, Width and Height to create Surface
- LATER: Applying the same transformations to dataset NEW
names(existing) = c("type", "id", "price", "stars_5", "stars_4", "stars_3", "stars_2", "stars_1", "pos_service", "neg_service", "would_recommend", "bestSeller_rank", "weight", "depth", "width", "height", "profit_margin", "volume")
existing$id = factor(existing$id)
existing$bestSeller_rank = factor(existing$bestSeller_rank)
existing$surface = (existing$depth^2)+(existing$height^2)+(existing$width^2)
glimpse(existing)## Observations: 80
## Variables: 19
## $ type <fct> PC, PC, PC, Laptop, Laptop, Accessories, Acces...
## $ id <fct> 101, 102, 103, 104, 105, 106, 107, 108, 109, 1...
## $ price <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99, 114....
## $ stars_5 <int> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, 1...
## $ stars_4 <int> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8, 62...
## $ stars_3 <int> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13, ...
## $ stars_2 <int> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, 7, 2...
## $ stars_1 <int> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, 16, ...
## $ pos_service <int> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 57...
## $ neg_service <int> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3, 3, ...
## $ would_recommend <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, 0...
## $ bestSeller_rank <fct> 1967, 4806, 12076, 109, 268, 64, NA, 2, NA, 18...
## $ weight <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7.30, 1...
## $ depth <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80, 6.70,...
## $ width <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10.30, 6....
## $ height <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 11.50, ...
## $ profit_margin <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05, 0.05...
## $ volume <int> 12, 8, 12, 196, 232, 332, 44, 132, 64, 40, 84,...
## $ surface <dbl> 902.2201, 2594.0625, 283.1800, 324.7000, 245.7...
Looking at the summary of the attributes
summary(existing)## type id price stars_5
## Accessories :26 101 : 1 Min. : 3.60 Min. : 0.0
## Printer :12 102 : 1 1st Qu.: 52.66 1st Qu.: 10.0
## Extended Warranty:10 103 : 1 Median : 132.72 Median : 50.0
## Software : 6 104 : 1 Mean : 247.25 Mean : 176.2
## Display : 5 105 : 1 3rd Qu.: 352.49 3rd Qu.: 306.5
## PC : 4 106 : 1 Max. :2249.99 Max. :2801.0
## (Other) :17 (Other):74
## stars_4 stars_3 stars_2 stars_1
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.75 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 2.00
## Median : 22.00 Median : 7.00 Median : 3.00 Median : 8.50
## Mean : 40.20 Mean : 14.79 Mean : 13.79 Mean : 37.67
## 3rd Qu.: 33.00 3rd Qu.: 11.25 3rd Qu.: 7.00 3rd Qu.: 15.25
## Max. :431.00 Max. :162.00 Max. :370.00 Max. :1654.00
##
## pos_service neg_service would_recommend bestSeller_rank
## Min. : 0.00 Min. : 0.000 Min. :0.100 7 : 8
## 1st Qu.: 2.00 1st Qu.: 1.000 1st Qu.:0.700 1 : 3
## Median : 5.50 Median : 3.000 Median :0.800 11 : 3
## Mean : 51.75 Mean : 6.225 Mean :0.745 2 : 2
## 3rd Qu.: 42.00 3rd Qu.: 6.250 3rd Qu.:0.900 3 : 2
## Max. :536.00 Max. :112.000 Max. :1.000 (Other):47
## NA's :15
## weight depth width height
## Min. : 0.0100 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.5125 1st Qu.: 4.775 1st Qu.: 1.750 1st Qu.: 0.400
## Median : 2.1000 Median : 7.950 Median : 6.800 Median : 3.950
## Mean : 9.6681 Mean : 14.425 Mean : 7.819 Mean : 6.259
## 3rd Qu.:11.2050 3rd Qu.: 15.025 3rd Qu.:11.275 3rd Qu.:10.300
## Max. :63.0000 Max. :300.000 Max. :31.750 Max. :25.800
##
## profit_margin volume surface
## Min. :0.0500 Min. : 0 Min. : 0.00
## 1st Qu.:0.0500 1st Qu.: 40 1st Qu.: 68.18
## Median :0.1200 Median : 200 Median : 175.10
## Mean :0.1545 Mean : 705 Mean : 1606.03
## 3rd Qu.:0.2000 3rd Qu.: 1226 3rd Qu.: 584.58
## Max. :0.4000 Max. :11204 Max. :90000.50
##
Do all the products have unique ID’s?
if (length(unique(existing$id)) == nrow(existing)) {
print(paste("There are", length(unique(existing$id)), "unique ID values out of", nrow(existing), "observations."))
} else { (print("CHECK: There is at least 1 repeated ID value in the 80 observations."))
}## [1] "There are 80 unique ID values out of 80 observations."
Is the best sellers rank field meaningful?
barplot(table(existing$bestSeller_rank), main = "Amount of products\nby rank", col = "purple", xlab = "Rank", ylab = "Number of products")sum(is.na(existing$bestSeller_rank))## [1] 15
Rank value is not unique for each product. Attribute meaning is not clear. Later on, we will decide if the attribute is worth keeping or not, by applying a matrix correlation and a decision tree. Also, there are 15 NA values.
Price, Service review and User Recommendation histograms
par(mfrow = c(2,2))
hist(existing$price, col = "purple", xlab = "Price â¬", main = "Histogram of Price")
hist(existing$would_recommend, col = "purple", xlab = "Would recommend?\n0 No, 1 Yes", main = "Histogram of user recommend")
hist(existing$pos_service, col = "purple", xlab = "Positive review", main = "Histogram of positive service")
hist(existing$neg_service, col = "purple", xlab = "Negative review", main = "Histogram of negative service")par(mfrow = c(1,1))- Most of the products cost up to 500 â¬, there are some products which cost more than 1000 â¬
- Most of the products are recommended by the users
- Overall, there are more positive reviews for service than negative ones
Studying the dependent variable: VOLUME
boxplot(existing$volume, col = "purple", varwidth = F)We can see there are a couple of outliers. Let’s check which type of items are they.
existing %>%
filter(volume >= 2500) %>%
select(type, volume) %>%
knitr::kable()| type | volume |
|---|---|
| Accessories | 11204 |
| Game Console | 7036 |
An accesory and a game console are selling many units. We will not remove these outliers right now. When doing the machine learning process to predict Volume, the error metrics using the outliers and removing them will be checked in order to see what works better.
Correlation matrices
Correlation matrix ALL attributes
mat = as.matrix(existing[, -c(1:2, 12)])
corr_mat = cor(mat)
corrplot(corr = corr_mat, diag = F)The attributes that have bigger correlation with Volume are Star Reviews and positive service. Keep in mind that correlation here is linear, any quadratic correlation will not be shown in this matrix.
Also, it looks like there is collinearity between Star reviews attributes. Further investigation and decisions will avoid bias on our model later on.
Correlation matrix Star reviews and Volume
Studying correlation and collinearity
mat_stars = as.matrix(existing[,c(4:8, 18)])
star_corr_mat = cor(mat_stars)
corrplot(corr = star_corr_mat, diag = F, method = "number")Volume and 5 star have a too-high correlation index. Basically, a correlation of 1 means that knowing the amount of 5 stars reviews of an item, we now exactly its sales volume, which does not make any sense in the real world. It is true that the more positive reviews an item has, the more it will sell, but not to the point where the amount of sales for any item is determined 100% on its 5 star reviews. In short words, there is something funny on the 5 star reviews values and we cannot use this data for later predictions.
For tat reason, 5 star reviews attribute will be removed completely, and some feature engineering will be carried on the other star reviews attributes, aswell as in other fields, like we did creating the Surface attribute using the Depth, Height and Width.
Feature engineering will allow us to avoid collinearity among attributes and improve correlation and machine learning processes.
Feature engineering
Let’s create a bunch of attributes and see if they have better a correlation with “volume”
existing = existing[, -c(2, 4, 13:16)] #Removing attributes that are not useful
existing$reviews_43 = existing$stars_4 + existing$stars_3
existing$reviews_21 = existing$stars_2 + existing$stars_1
existing$reviews_4321 = existing$stars_4 + existing$stars_3 + existing$stars_2 + existing$stars_1
existing$total_service_reviews = existing$pos_service + existing$neg_service
existing$total_positive_reviews = existing$stars_4 + existing$stars_3 + existing$pos_service
existing$total_negative_reviews = existing$stars_2 + existing$stars_1 + existing$neg_service
names(existing)## [1] "type" "price"
## [3] "stars_4" "stars_3"
## [5] "stars_2" "stars_1"
## [7] "pos_service" "neg_service"
## [9] "would_recommend" "bestSeller_rank"
## [11] "profit_margin" "volume"
## [13] "surface" "reviews_43"
## [15] "reviews_21" "reviews_4321"
## [17] "total_service_reviews" "total_positive_reviews"
## [19] "total_negative_reviews"
- reviews_43 = Sum of
4 star & 3 starattributes - reviews_21 = Sum of
2 star & 1 starattributes - reviews_4321 = Sum of
4, 3, 2 & 1 starattributes - total_service_reviews = Sum of
positive&negative serviceattributes - total_positive_reviews = Sum of
4, 3 starreviews andpositive service - total_negative_reviews = Sum of
2, 1 starreviews andnegative service
Correlation matrix with newly created attributes
mat_stars = as.matrix(existing[, -c(1, 10)])
star_corr_mat = cor(mat_stars)
corrplot(corr = star_corr_mat, diag = F, method = "number", number.cex = 0.5)Let’s study this correlation matrix:
There are a lot of attributes now, correlated somehow with volume. We need to filter which attributes will be the best ones to work with for volume prediction.
Higher correlated attributes with volume are 4 star, 3 star, positive service, 4+3 star, total_service and total_positive. What happens here is that we have a collinearity issue we have to fix here.
4 star is very correlated with volume, but also with 3 star, 4+3 star & total positive reviews. This means that these variables “explain” the same information. We do not want to keep all of them to avoid biasing our models later. We want to keep as much of that information as possible, so, we will keep the attribute that is more corretaled with volume that has no high correlation with other attributes at the same time. This means that, in this case, we will stick with 4 star and will not use 3 star, 4+3 star & total positive review.
Must be said, though, that a correlation matrix only shows linear correlation between attributes, meaning that if there is some kind of quadratic correlation between volume and other attribute, it will not appear in the matrix correlation (correlation will be 0 or close to 0). In order to study other kinds of correlation rather than linear, we can use a Random Forest.
Let’s build a Random Forest in order to see how the attributes distribute the information for the variable volume.
Attributes used are price + stars_4 + pos_service + neg_service + would_recommend + reviews_21
summary_RF_some$variable.importance## stars_4 reviews_21 neg_service pos_service
## 78458578 59323385 27749245 16270987
rpart.plot(RF_some, type = 5, shadow.col = "black")We can see that the variable importance for the RF plot above defines 4 star, 2+1 reviews, neg_service & pos_service as the most important ones. This only means that each of these attributes appears many times in the Random Forest (they each appear many times in each tree in the random forest).
What really tells about importance is the plot, that shows which attributes are the most important ones for splitting the data, and those are 4 star and positive service.
Let’s compare this random forest (built with some attributes) with a random forest using all the original attributes plus created ones at feature engineering section.
summary_RF_all$variable.importance## reviews_43 stars_4 reviews_4321
## 73672994 73672994 60280502
## reviews_21 stars_2 stars_3
## 52623567 52623567 52623567
## pos_service total_positive_reviews total_service_reviews
## 16270987 14356753 14356753
## stars_1 total_negative_reviews
## 8614052 8614052
rpart.plot(RF_all, type = 5, shadow.col = "black")We can see that the variable importance defines the same atributes than the last Random Forest. Also, the plot shows the same attributes for splitting the data.
Based on the correlation matrix and Random Forest approach, our final decision on which attributes are to be selected for predicting volume are 4 star and positive service.
existing_2attributes = existing %>%
select(stars_4, pos_service, volume)Machine Learning process. Predicting Volume sales.
We will be using LM, RF, KNN and SVM, then deciding which one works better and analysing the error later on.
Partitioning data
set.seed(123)
sample = sample.int(n = nrow(existing_2attributes),
size = floor(0.80*nrow(existing_2attributes)),
replace = F)
train = existing_2attributes[sample, ]
test = existing_2attributes[-sample, ]Models
Linear Regression
lm_fit = lm(formula = volume~., data = train)
summary(lm_fit)##
## Call:
## lm(formula = volume ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1479.13 -80.99 5.07 41.03 3116.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.9281 85.1748 -0.210 0.83399
## stars_4 12.8608 1.1887 10.819 8.05e-16 ***
## pos_service 2.3365 0.7266 3.216 0.00208 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 548.8 on 61 degrees of freedom
## Multiple R-squared: 0.7122, Adjusted R-squared: 0.7028
## F-statistic: 75.49 on 2 and 61 DF, p-value: < 2.2e-16
We can see how both independent variables are statistically significant, which means that changes on them affect to the dependant variable. Next step is predicting volume.
lm_pred = predict(object = lm_fit, newdata = test)
lm_pred## 1 11 12 19 25 26
## 25.32734 12.46651 324.62093 143.41126 397.11291 1546.33857
## 28 29 50 53 55 61
## 96.64093 -15.59163 6777.44571 657.70234 488.29680 418.16159
## 62 71 79 80
## -17.92812 858.92427 393.57776 1168.72192
Those are the predicted volume sales for the test products using a linear model. Let’s build the other models, then we will compare the error metrics in a matrix and will proceed to th enext step.
Random Forest
rf_fit = randomForest(volume~., data = train)
rf_pred = predict(object = rf_fit, newdata = test)K-nearest neighbours
knn_fit = caret::knnreg(formula = volume~., data = train, k = 3)
knn_pred = predict(object = knn_fit, newdata = test)Support Vector Machine
svm_fit = svm(formula = volume~., data = train)
svm_pred = predict(object = svm_fit, newdata = test)Error metrics comparison
error.matrix = matrix(nrow = 4, ncol = 3,
dimnames = list(c("LM", "RF", "KNN", "SVM"),
c("RMSE", "MAE", "MDAE")))
error.matrix[1,1] = rmse(actual = test$volume, predicted = lm_pred)
error.matrix[2,1] = rmse(actual = test$volume, predicted = rf_pred)
error.matrix[3,1] = rmse(actual = test$volume, predicted = knn_pred)
error.matrix[4,1] = rmse(actual = test$volume, predicted = svm_pred)
error.matrix[1,2] = mae(actual = test$volume, predicted = lm_pred)
error.matrix[2,2] = mae(actual = test$volume, predicted = rf_pred)
error.matrix[3,2] = mae(actual = test$volume, predicted = knn_pred)
error.matrix[4,2] = mae(actual = test$volume, predicted = svm_pred)
error.matrix[1,3] = mdae(actual = test$volume, predicted = lm_pred)
error.matrix[2,3] = mdae(actual = test$volume, predicted = rf_pred)
error.matrix[3,3] = mdae(actual = test$volume, predicted = knn_pred)
error.matrix[4,3] = mdae(actual = test$volume, predicted = svm_pred)kable(error.matrix)| RMSE | MAE | MDAE | |
|---|---|---|---|
| LM | 1197.945 | 549.3654 | 137.63725 |
| RF | 1841.609 | 633.0917 | 48.96955 |
| KNN | 2421.314 | 795.2125 | 48.66667 |
| SVM | 2516.003 | 841.4344 | 97.82619 |
In the end, the models are not too good predictors (then again, we only have 80 observations). Anyway, we are interested in the items for which the prediction is accurate for business purposes.
new_2attributes = new %>%
select(X.4.Star.Reviews., X.Positive.Service.Review.) %>%
transmute(stars_4 = as.integer(X.4.Star.Reviews.),
pos_service = as.integer(X.Positive.Service.Review.))— TO BE CONTINUED —