Multiple Regression in R

Executive Summary

Introduction

We were requested to conduct further analysis regarding an earlier data mining operation, in which we used RapidMiner to predict projected sales volumes and subsequent profitability for a shortlist of products considered as possible additions to Blackwell’s selection.

Specifically, we were requested to pick up where we left off, by providing a more elaborate analysis based on the same data. For this analysis, we were asked to explore what, if any, influence the respective product types have on the projecting of sales volumes. Furthermore, we were requested to:

  1. Predict sales volumes of the following product types:
  1. PC;
  2. Laptops;
  3. Netbooks; and
  4. Smartphones
  1. Assess the impact that services’ and customer satisfaction

The purpose of this summary is to provide the reader with an overview of our key findings and consists appropriately of weighted conclusions. We refer you to the Technical Report that is appended hereto for technical documentation and argumentation.

About the data

We implemented our analysis on the same historical data that was used as basis for our projection regarding the envisaged new products. Consequently, the dataset imposes the same limitations as they did in regard of predicting the sales volumes of new products.

The data consisted of 80 observations of 18 attributes in regard of products sold by Blackwell. Our tests show that the most influential attributes in predicting sales were the attributes related to customer sentiment as is evidenced in the technical report.

We note that the data is unevenly distributed and contained a few abnormalities. There were also a few obvious errors from the data collection phase as well as missing data which resulted in the deletion of a number of observations and one attribute, which we would have been glad to avoid considering the sparsity of the dataset.

Results

We began by analyzing the influence of the attribute “ProductType” on predicting sales volume. We did this through a variance-analysis method, namely ANOVA. The analysis showed that said attribute had no significance toward an inferential analysis of sales.

This task was a so-called multiple-regression task. We trained the number of supervised-learning algorithms and received the best results with a decision-tree algorithm called Random Forest. Our training yielded a model that inferred the volume from the data with a 92% accuracy.

Prediction

The projected sales volumes for the product types are presented in the following table.

ProductType ProductNum Brand Volume
Netbook 180 Acer 1191
Smartphone 194 Samsung 899
Smartphone 193 Motorola 261
PC 171 Dell 217
Laptop 173 Apple 177
Netbook 181 Asus 159
PC 172 Dell 104
Smartphone 196 Motorola 95
Smartphone 195 HTC 84
Netbook 178 HP 62
Laptop 175 Toshiba 54
Netbook 183 Samsung 43
Laptop 176 Razer 19

Discussion

As the data and available algorithms were the same as last time and so were the algorithms available to us, there was no material difference nor material knowledge gains achieved by this exercise. The central take-away is that, in relation to this dataset, product type carries little weight in predicting sales.

Visualization and data exploration:

We can infer the following already from the visual exploration of the dataset:

  1. x5StarReviews has a suspiciously strong positive correlation with Volume;
  2. x4StarReviews has a very strong positive correlation with Volume;
  3. all StarReviews are positively correlated with Volume, even 1 and 2 starreviews, which are generally considered as expressions of negative customer sentiment. This observation suggests that it is the number of reviews rather than the quality of the same that is related to Volume.
  4. Attributes relating to physical or financial aspects of the products are largely irrelevant for the purposes of inferential statistics.

Data preprocessing

[1] 6
[1] TRUE
 ProductType          ProductNum      Price         x5StarReviews   
 Length:73          Min.   :101   Min.   :   3.60   Min.   :   0.0  
 Class :character   1st Qu.:119   1st Qu.:  49.99   1st Qu.:   8.0  
 Mode  :character   Median :144   Median : 129.00   Median :  29.0  
                    Mean   :143   Mean   : 253.06   Mean   : 163.6  
                    3rd Qu.:162   3rd Qu.: 379.99   3rd Qu.: 164.0  
                    Max.   :200   Max.   :2249.99   Max.   :2801.0  
                                                                    
 x4StarReviews    x3StarReviews    x2StarReviews    x1StarReviews    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :   0.00  
 1st Qu.:  2.00   1st Qu.:  2.00   1st Qu.:  1.00   1st Qu.:   2.00  
 Median : 17.00   Median :  6.00   Median :  3.00   Median :   6.00  
 Mean   : 41.47   Mean   : 15.44   Mean   : 14.82   Mean   :  39.85  
 3rd Qu.: 37.00   3rd Qu.: 12.00   3rd Qu.:  8.00   3rd Qu.:  16.00  
 Max.   :431.00   Max.   :162.00   Max.   :370.00   Max.   :1654.00  
                                                                     
 PositiveServiceReview NegativeServiceReview Recommendproduct
 Min.   :  0.00        Min.   :  0.000       Min.   :0.1000  
 1st Qu.:  2.00        1st Qu.:  1.000       1st Qu.:0.7000  
 Median :  5.00        Median :  2.000       Median :0.8000  
 Mean   : 29.86        Mean   :  6.055       Mean   :0.7301  
 3rd Qu.: 13.00        3rd Qu.:  4.000       3rd Qu.:0.9000  
 Max.   :536.00        Max.   :112.000       Max.   :1.0000  
                                                             
 BestSellersRank    ShippingWeight   ProductDepth     ProductWidth   
 Min.   :    1.00   Min.   : 0.01   Min.   :  0.00   Min.   : 0.000  
 1st Qu.:   10.25   1st Qu.: 0.90   1st Qu.:  5.90   1st Qu.: 3.000  
 Median :   49.00   Median : 2.40   Median :  8.80   Median : 7.600  
 Mean   : 1261.40   Mean   :10.58   Mean   : 15.81   Mean   : 8.569  
 3rd Qu.:  386.50   3rd Qu.:13.00   3rd Qu.: 15.50   3rd Qu.:11.700  
 Max.   :17502.00   Max.   :63.00   Max.   :300.00   Max.   :31.750  
 NA's   :15                                                          
 ProductHeight     ProfitMargin       Volume       
 Min.   : 0.000   Min.   :0.050   Min.   :    0.0  
 1st Qu.: 0.600   1st Qu.:0.050   1st Qu.:   32.0  
 Median : 5.200   Median :0.110   Median :  116.0  
 Mean   : 6.859   Mean   :0.131   Mean   :  654.5  
 3rd Qu.:11.500   3rd Qu.:0.170   3rd Qu.:  656.0  
 Max.   :25.800   Max.   :0.400   Max.   :11204.0  
                                                   

We can observe that there are missing values in the “BestSellerRank” column. As we cannot predict or obtain the value of this column based on the other features. We are also removing the two biggest outliers that make the biggest impact, corresponding to the volumes of 11204 and 7036.

            Df   Sum Sq Mean Sq F value Pr(>F)
ProductType 11  5075621  461420   1.473  0.166
Residuals   59 18487579  313349               

The PValue is too big, so our categorical variables have no relation between the dependent variable (Volume). This means we are not considering the ProductType for our model but we will store it as a vector for later uses (plots). ## Feature Selection

[1] "x5StarReviews"         "Volume"                "x3StarReviews"        
[4] "x2StarReviews"         "NegativeServiceReview"
Variable1 Variable2 AbsCor Cor
1 x5StarReviews Volume 1.0000 1.0000
8 x4StarReviews Volume 0.8041 0.8041
12 x3StarReviews Volume 0.6864 0.6864
18 PositiveServiceReview Volume 0.5658 0.5658
22 x2StarReviews Volume 0.5159 0.5159
25 NegativeServiceReview Volume 0.4997 0.4997
28 x1StarReviews Volume 0.4124 0.4124
34 ProductDepth Volume 0.3203 0.3203
38 ShippingWeight Volume 0.2690 -0.2690
45 Recommendproduct Volume 0.1834 0.1834
49 Price Volume 0.1742 -0.1742
51 ProfitMargin Volume 0.1681 -0.1681
69 ProductNum Volume 0.0989 0.0989
78 ProductWidth Volume 0.0900 -0.0900
97 ProductHeight Volume 0.0421 -0.0421

Here we can observe that the reviews are highly correlated with each other. To sum up, we are removing the x3StarReviews and the x1StarReviews, as they are highly correlated to x4StarReviews and x2StarReviews respectively. Furthermore, we will plot a decision tree to see the variables that have the biggest impact.

We can observe that the variables that have the biggest impact are x4StarReviews and PositiveServiceReview, which are also the ones that have the highest correlation.

Here we create a new variable with the StarReviews. We will do this by using a linear regression and we’ll create an “average weighted star review” based on the coefficients of the regression model and its respectively variable.

                 Estimate Std. Error     t value     Pr(>|t|)
(Intercept)    92.8996017  50.904372  1.82498276 7.252930e-02
x4StarReviews  14.2183632   2.572673  5.52668819 5.977552e-07
x3StarReviews -14.7529735   9.979575 -1.47831683 1.440798e-01
x2StarReviews   3.2275066  12.659522  0.25494696 7.995568e-01
x1StarReviews   0.0484845   2.178431  0.02225662 9.823104e-01
Variable1 Variable2 AbsCor Cor
10 Volume Avg_WghtStar 0.8163 0.8163
Variable1 Variable2 AbsCor Cor
12 x4StarReviews Volume 0.8041 0.8041

Here we can observe that the new variable has better correlation than x4StarReviews, which was the one with the highest relationship with volume.

Modeling

We create a loop to train several features with several models.

note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
[1] 123.048
Volume ~ Avg_WghtStar + PositiveServiceReview glm Volume ~ Avg_WghtStar + PositiveServiceReview svmRadial Volume ~ Avg_WghtStar + PositiveServiceReview svmLinear Volume ~ Avg_WghtStar + PositiveServiceReview knn Volume ~ Avg_WghtStar + PositiveServiceReview rf Volume ~ x4StarReviews + PositiveServiceReview glm Volume ~ x4StarReviews + PositiveServiceReview svmRadial Volume ~ x4StarReviews + PositiveServiceReview svmLinear Volume ~ x4StarReviews + PositiveServiceReview knn Volume ~ x4StarReviews + PositiveServiceReview rf
RMSE 359.7378551 417.5630488 349.2470074 398.5383294 218.6926097 395.2775566 432.1760144 377.3323953 249.506351 219.9479104
Rsquared 0.7569293 0.7671428 0.7995331 0.7991097 0.9215047 0.6851846 0.7425961 0.7309845 0.939719 0.9186288
MAE 197.1320486 248.5718783 170.4095309 227.7000000 123.0479857 233.7248491 259.8334941 187.7202784 134.294048 125.8509355
note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

As we can see from the plots, the relative error is at its largest at low volumes, which is nearly always the case as relative error is the absolute error as a fraction of the observation. The absolute error is greater at greater volumes.

Volume of the new products
ProductType ProductNum Volume
PC 171 217
PC 172 104
Laptop 173 177
Laptop 175 54
Laptop 176 19
Netbook 178 62
Netbook 180 1191
Netbook 181 159
Netbook 183 43
Smartphone 193 261
Smartphone 194 899
Smartphone 195 84
Smartphone 196 95

Elias Smouni, Joël Ribera Zaragoza

2019-07-18