Task overview

BUSINESS QUESTION: Which are the top 5 products that are going to be more profitable for the company?

What data do we have?

New product attributes and existing product attributes.

  • Predicting sales of four different product types: PC, Laptops, Netbooks and Smartphones
  • Assessing the impact services reviews and customer reviews have on sales of different product types

Index

  1. Data cleaning

  2. Data exploration

  3. Pre-process: feature selection (correlation matrix) & feature engineering

  4. Modalization: linear regresion, KNN, SVM and Random forest

  5. Error analysis

Exploring the data

## Observations: 80
## Variables: 18
## $ ProductType           <chr> "PC", "PC", "PC", "Laptop", "Laptop", "A...
## $ ProductNum            <dbl> 101, 102, 103, 104, 105, 106, 107, 108, ...
## $ Price                 <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99...
## $ x5StarReviews         <dbl> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21,...
## $ x4StarReviews         <dbl> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25,...
## $ x3StarReviews         <dbl> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5...
## $ x2StarReviews         <dbl> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8...
## $ x1StarReviews         <dbl> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1...
## $ PositiveServiceReview <dbl> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, ...
## $ NegativeServiceReview <dbl> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, ...
## $ Recommendproduct      <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, ...
## $ BestSellersRank       <dbl> 1967, 4806, 12076, 109, 268, 64, NA, 2, ...
## $ ShippingWeight        <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7...
## $ ProductDepth          <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80,...
## $ ProductWidth          <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10....
## $ ProductHeight         <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 1...
## $ ProfitMargin          <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05...
## $ Volume                <dbl> 12, 8, 12, 196, 232, 332, 44, 132, 64, 4...

Data cleaning

Transformation to factor:

fact_var <- c("ProductType","ProductNum")
ex_prod[,fact_var] <- apply(ex_prod[,fact_var], 2, as.factor)

Giving names to the rows:

ex_prod <- tibble::column_to_rownames(.data = ex_prod,
                                     var = "ProductNum")
ex_prod$ProductNum <- NULL

Data cleaning: detecting NA with VIM of Best Seller Rank

## 
##  Variables sorted by number of missings: 
##               Variable  Count
##        BestSellersRank 0.1875
##            ProductType 0.0000
##                  Price 0.0000
##          x5StarReviews 0.0000
##          x4StarReviews 0.0000
##          x3StarReviews 0.0000
##          x2StarReviews 0.0000
##          x1StarReviews 0.0000
##  PositiveServiceReview 0.0000
##  NegativeServiceReview 0.0000
##       Recommendproduct 0.0000
##         ShippingWeight 0.0000
##           ProductDepth 0.0000
##           ProductWidth 0.0000
##          ProductHeight 0.0000
##           ProfitMargin 0.0000
##                 Volume 0.0000

1st data expl.: Blackwell business

1st data expl.: Volume distribution

1st modalization: linear regression

# train and test
train_id <- createDataPartition(y = ex_prod$Volume, p = 0.80, list = F)
train <- ex_prod[train_id,]
test <- ex_prod[-train_id,]

# create linear regression model
mod_lm <- lm(formula = Volume ~ ., data = train)
metric train test
RMSE 0 0
R^2 100 % 100 %

Main predictors:

  1. 5 stars
  2. Product type: Game console

2nd pre-process: feature selection

2nd modalization: linear regression

metric train test
RMSE 0 0
R^2 100 % 100 %

Main predictors:

  1. 5 stars
  2. Product type: PC
  3. Price

The model is overfitted again.

3rd pre-process: outliers in stars features

3rd pre-process: feature engineering

3rd pre-process: corr. matrix with total stars

3rd pre-process: corr. matrix with x4 and x2

3rd modalization: linear regression

metric train test
RMSE 307.07 276.19
R^2 71.77 % 80.47 %

My model is not overfitted, but has a very low performance. Let’s check where it is failing!

3rd error check: errors visualization lm

4th exploration: recommandation variable

4th pre-process: repeated observations

product_num ProductType Pos_Ser Neg_Ser Recomend Vol
132 ExtendedWarranty 0 3 0.4 0
133 ExtendedWarranty 0 1 0.6 20
134 ExtendedWarranty 280 8 0.9 1232
135 ExtendedWarranty 280 8 0.9 1232
136 ExtendedWarranty 280 8 0.9 1232
137 ExtendedWarranty 280 8 0.9 1232
138 ExtendedWarranty 280 8 0.9 1232
139 ExtendedWarranty 280 8 0.9 1232
140 ExtendedWarranty 280 8 0.9 1232
141 ExtendedWarranty 280 8 0.9 1232

4th feature engineering: pos. and neg. service

4th modalization: linear regression

metric train test
RMSE 333.82 241.01
R^2 65.46 % 69.35 %

The model has improved a little bit. Let’s see how is performing to the categories we are interested.

4th error check: error visualization lm

5th modalization: using knn with caret

# defining variables to create the model 
rel_var <- c("x4","x2","Pos_Ser","Neg_Ser","Recomend","Vol","PC","Laptop",
             "Netb","Smart_Ph")

# cross validation
ctrl <- caret::trainControl(method = "repeatedcv",
                            number = 10,
                            repeats = 3)

# modalization 
mod_5knn_caret <- caret::train(Vol ~.,
                               method = "knn",
                               data = train[,rel_var],
                               trControl = ctrl, 
                               preProcess = c("center","scale"))
metric train test
RMSE 248.68 153.56
R^2 84.6 % 90.4 %

5th error check: error visualization knn

6th modalization: using Random Forest

set.seed(123)
mod_6rf <- caret::train(Vol ~ .,
                       method = "rf",
                       data = train[,rel_var],
                       trControl = ctrl)
metric train test
RMSE 102.03 84.67
R^2 97 % 98.03 %

6th error check: error visualization rf

7th modalization: using Support Vector Machine

set.seed(123)
mod_7svm <- caret::train(Vol ~ .,
                       method = "svmLinear",
                       data = train[,rel_var],
                       trControl = ctrl)
metric train test
RMSE 373.13 334.86
R^2 61.16 % 44.33 %

7th error check: error visualization SVM

Model application and results