Predicting profitability in R

Task overview

BUSINESS QUESTION: Which are the top 5 products that are going to be more profitable for the company?

What data do we have?

New product attributes and existing product attributes.

Predicting sales of four different product types: PC, Laptops, Netbooks and Smartphones
Assessing the impact services reviews and customer reviews have on sales of different product types

Index

Data cleaning
Data exploration
Pre-process: feature selection (correlation matrix) & feature engineering
Modalization: linear regresion, KNN, SVM and Random forest
Error analysis

Exploring the data

## Observations: 80
## Variables: 18
## $ ProductType           <chr> "PC", "PC", "PC", "Laptop", "Laptop", "A...
## $ ProductNum            <dbl> 101, 102, 103, 104, 105, 106, 107, 108, ...
## $ Price                 <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99...
## $ x5StarReviews         <dbl> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21,...
## $ x4StarReviews         <dbl> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25,...
## $ x3StarReviews         <dbl> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5...
## $ x2StarReviews         <dbl> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8...
## $ x1StarReviews         <dbl> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1...
## $ PositiveServiceReview <dbl> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, ...
## $ NegativeServiceReview <dbl> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, ...
## $ Recommendproduct      <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, ...
## $ BestSellersRank       <dbl> 1967, 4806, 12076, 109, 268, 64, NA, 2, ...
## $ ShippingWeight        <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7...
## $ ProductDepth          <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80,...
## $ ProductWidth          <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10....
## $ ProductHeight         <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 1...
## $ ProfitMargin          <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05...
## $ Volume                <dbl> 12, 8, 12, 196, 232, 332, 44, 132, 64, 4...

Data cleaning

Transformation to factor:

fact_var <- c("ProductType","ProductNum")
ex_prod[,fact_var] <- apply(ex_prod[,fact_var], 2, as.factor)

Giving names to the rows:

ex_prod <- tibble::column_to_rownames(.data = ex_prod,
                                     var = "ProductNum")
ex_prod$ProductNum <- NULL

Data cleaning: detecting NA with VIM of Best Seller Rank

## 
##  Variables sorted by number of missings: 
##               Variable  Count
##        BestSellersRank 0.1875
##            ProductType 0.0000
##                  Price 0.0000
##          x5StarReviews 0.0000
##          x4StarReviews 0.0000
##          x3StarReviews 0.0000
##          x2StarReviews 0.0000
##          x1StarReviews 0.0000
##  PositiveServiceReview 0.0000
##  NegativeServiceReview 0.0000
##       Recommendproduct 0.0000
##         ShippingWeight 0.0000
##           ProductDepth 0.0000
##           ProductWidth 0.0000
##          ProductHeight 0.0000
##           ProfitMargin 0.0000
##                 Volume 0.0000

1st data expl.: Blackwell business

1st data expl.: Volume distribution

1st modalization: linear regression

# train and test
train_id <- createDataPartition(y = ex_prod$Volume, p = 0.80, list = F)
train <- ex_prod[train_id,]
test <- ex_prod[-train_id,]

# create linear regression model
mod_lm <- lm(formula = Volume ~ ., data = train)

metric	train	test
RMSE	0	0
R^2	100 %	100 %

Main predictors:

5 stars
Product type: Game console

2nd pre-process: feature selection

2nd modalization: linear regression

metric	train	test
RMSE	0	0
R^2	100 %	100 %

Main predictors:

5 stars
Product type: PC
Price

The model is overfitted again.

3rd pre-process: outliers in stars features

3rd pre-process: feature engineering

3rd pre-process: corr. matrix with total stars

3rd pre-process: corr. matrix with x4 and x2

3rd modalization: linear regression

metric	train	test
RMSE	307.07	276.19
R^2	71.77 %	80.47 %

My model is not overfitted, but has a very low performance. Let’s check where it is failing!

3rd error check: errors visualization lm

4th exploration: recommandation variable

4th pre-process: repeated observations

product_num	ProductType	Pos_Ser	Neg_Ser	Recomend	Vol
132	ExtendedWarranty	0	3	0.4	0
133	ExtendedWarranty	0	1	0.6	20
134	ExtendedWarranty	280	8	0.9	1232
135	ExtendedWarranty	280	8	0.9	1232
136	ExtendedWarranty	280	8	0.9	1232
137	ExtendedWarranty	280	8	0.9	1232
138	ExtendedWarranty	280	8	0.9	1232
139	ExtendedWarranty	280	8	0.9	1232
140	ExtendedWarranty	280	8	0.9	1232
141	ExtendedWarranty	280	8	0.9	1232

4th feature engineering: pos. and neg. service

4th modalization: linear regression

metric	train	test
RMSE	333.82	241.01
R^2	65.46 %	69.35 %

The model has improved a little bit. Let’s see how is performing to the categories we are interested.

4th error check: error visualization lm

5th modalization: using knn with caret

# defining variables to create the model 
rel_var <- c("x4","x2","Pos_Ser","Neg_Ser","Recomend","Vol","PC","Laptop",
             "Netb","Smart_Ph")

# cross validation
ctrl <- caret::trainControl(method = "repeatedcv",
                            number = 10,
                            repeats = 3)

# modalization 
mod_5knn_caret <- caret::train(Vol ~.,
                               method = "knn",
                               data = train[,rel_var],
                               trControl = ctrl, 
                               preProcess = c("center","scale"))

metric	train	test
RMSE	248.68	153.56
R^2	84.6 %	90.4 %

5th error check: error visualization knn

6th modalization: using Random Forest

set.seed(123)
mod_6rf <- caret::train(Vol ~ .,
                       method = "rf",
                       data = train[,rel_var],
                       trControl = ctrl)

metric	train	test
RMSE	102.03	84.67
R^2	97 %	98.03 %

6th error check: error visualization rf

7th modalization: using Support Vector Machine

set.seed(123)
mod_7svm <- caret::train(Vol ~ .,
                       method = "svmLinear",
                       data = train[,rel_var],
                       trControl = ctrl)

metric	train	test
RMSE	373.13	334.86
R^2	61.16 %	44.33 %