Sergi Calderón and Xènia Fabregat

Executive summary

In this task, Danielle asked us to predict sales volume for a list of new product types, based on historical sales data. For this task, Danielle wanted us to evaluate how product types could impact those sales. However, after performing some analysis we discover that product type has no significant impact on sales volume. This could be due to many factors: the sample is very small and this could skew our predictions, maybe the classification isn’t optimal (we don’t know very well according to which criteria product types are grouped: why are netbooks and laptops in two different groups? What do accessories include and could they be split?),…

This is why, we based our predictions on two other variables: 4 stars review and positive service review. Thanks to these we got the following predictions:

library(readr)
library(ggplot2)

predictiongraph <- read_csv("C:/Users/user/Documents/multiple regression/newproductw_prediction.csv")
net_sales <- (predictiongraph$predictedvolume * predictiongraph$Price)
predictiongraph <- cbind(net_sales, predictiongraph)
ggplot(predictiongraph, aes(y = net_sales, x = ProductType, fill = ProductType)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

According to them it looks like we should invest in Game Consoles, Tablets, Netbooks and PCs. However, these predictions have a low condifence level because, as stated before, the size of the data set on which predictions are based is very small (only 80 observations). Therefore, we would suggest to use these predictions carefully taking this into account and, as a next step, performing again this analysis on a bigger sample.

Technical appendix

# Load Libraries

library(caret)
library(corrplot)
library(party)
library(lattice)
library(data.table)
existing_product_attributes2017 <- read_csv("C:/Users/user/Downloads/existingproductattributes2017.csv")
existing_product_attributes2017$BestSellersRank <- NULL
new_product_attributes2017 <- read_csv("C:/Users/user/Downloads/newproductattributes2017.csv")
new_product_attributes2017$BestSellersRank <- NULL

Data understanding & Pre processing

To tackle this task, our first step was to understand and pre process the data we were given, and as Danielle was asking, put special focus on assessing whether Product Type impacted sales volume or not.

Missing data

We can see that Best Seller rank has missing data, this is why we will dismiss this variable.

Categorical to binary data conversion

Once we have no missing data, we want to find correlations with the different variables, and to do so we will need to dummify our categorical variable: product type. Once this is done, we can spot the following correlations:

dummy_existing1 <- dummyVars(" ~ .", data = existing_product_attributes2017)
dummy_existing <- data.frame(predict(dummy_existing1, newdata = existing_product_attributes2017))
correlation_matrix <- cor(dummy_existing)
corrplot(correlation_matrix, tl.cex = 0.5)

Feature selection

Thanks to this correlation matrix we can see that there are 6 variables that have a high correlation with volume:

  1. 5 stars review

  2. 4 stars review

  3. 3 stars review

  4. Positive service review

  5. 2 stars review

  6. Product Type Game Console

However, we can see also that 5 stars review has a 100% correlation with volume, which is impossible and must be by chance, so we can’t use it. Also that 3 and 2 stars review are colinear with 4 stars review. This is why we will remove them and make sure we can use 4 stars review, Positive service review and Product Type Game Console.

dummyok <- subset(existing_product_attributes2017, select = c("ProductType", "Volume", "x4StarReviews", "PositiveServiceReview"))
dummyok$ProductType <- as.factor(dummyok$ProductType)

Out of these 3 remaining variables, we will inspect further the distribution of the variable with the weakest correlation, Product Type.

Let’s have a closer look at Product Type distribution. Indeed if we look at the boxplot distributions, Game console have huge volume, but that’s skewed due to the distribution of the data:

ggplot(dummyok, aes(x = Volume)) + geom_density() + facet_wrap(~ ProductType) # copied from David during the code review

ggplot(existing_product_attributes2017, aes(x= ProductType, y = Volume)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(existing_product_attributes2017, aes(x= ProductType,)) + geom_bar() + stat_count() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

We can see that Game Console has only 2 data points, which is why we can’t rely on it. Furthermore, if we look at all Product Types and volume, we can find no relationship:

summary(aov(Volume ~ ProductType, data = existing_product_attributes2017))
##             Df    Sum Sq Mean Sq F value Pr(>F)  
## ProductType 11  38888822 3535347   1.683 0.0961 .
## Residuals   68 142859034 2100868                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we perform an anova analysis we can see that the P value is too high, and this is why we decided not to use Product Type at all, and use only 4 stars reviews and positive service review.

dummyok <- subset(existing_product_attributes2017, select = c("Volume", "x4StarReviews", "PositiveServiceReview"))

Modeling

To decide on the best model possible, we ran 4 different models (svm, knn, random forest and linear model). We saw that due to the sample size the seed used changed dramatically the performance of the models used, and this is why we tried these 4 models with 10 different seeds each:

for (i in 1:10) {
  
set.seed(i)
inTrain <- createDataPartition(y = dummyok$Volume, p = .80, list = FALSE)
training <- dummyok[inTrain,]
testing <- dummyok[-inTrain,]

models <- c("svmLinear", "knn", "rf", "lm")

resum <- c()
resumseed <- c()

    for (j in models) {
      
      fit <- train(Volume ~ ., data = training, method = j)
      predictedvolume <- predict(fit, testing)
      hola <- postResample(pred = predictedvolume, obs = testing$Volume)
      resum <- cbind(hola, resum)
      
      
    }

colnames(resum) <-(models)
resumseed <- rbind(resum, resumseed)

}
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## 
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

After doing so, we see that knn is the model that gives us a better performance:

resumseedok <- as.data.frame(melt(resumseed))

SVM <- subset.data.frame(resumseedok, resumseedok$Var2 == "svmLinear")
KNN <- subset.data.frame(resumseedok, resumseedok$Var2 == "knn")
RF <- subset.data.frame(resumseedok, resumseedok$Var2 == "rf")
LM <- subset.data.frame(resumseedok, resumseedok$Var2 == "lm")

resumerrormetrics <- cbind(tapply(SVM$value, SVM$Var1, mean), tapply(KNN$value, KNN$Var1, mean), tapply(RF$value, RF$Var1, mean), tapply(LM$value, LM$Var1, mean))

colnames(resumerrormetrics) <-(models)

resumerrormetrics
##            svmLinear         knn          rf          lm
## RMSE     653.2535981 244.2500631 331.5873744 463.0169290
## Rsquared   0.6311471   0.8234713   0.7173373   0.6234875
## MAE      358.7889570 108.7636232 184.6688215 213.2620457

And this is why we will use it to do our predictions:

fit <- train(Volume ~ ., data = training, method = "knn", importance=T)
predictedvolume <- round(predict(fit, new_product_attributes2017), digits = 0)
newproductw_prediction <- cbind(predictedvolume, new_product_attributes2017)
newproductw_prediction
##    predictedvolume      ProductType ProductNum   Price x5StarReviews
## 1              315               PC        171  699.00            96
## 2               84               PC        172  860.00            51
## 3               82           Laptop        173 1199.00            74
## 4               21           Laptop        175 1199.00             7
## 5               12           Laptop        176 1999.00             1
## 6               60          Netbook        178  399.99            19
## 7              837          Netbook        180  329.00           312
## 8              133          Netbook        181  439.00            23
## 9               28          Netbook        183  330.00             3
## 10             783           Tablet        186  629.00           296
## 11            1967           Tablet        187  199.00           943
## 12             254       Smartphone        193  199.00            99
## 13             456       Smartphone        194   49.00           100
## 14              67       Smartphone        195  149.00            42
## 15             154       Smartphone        196  300.00            50
## 16             837      GameConsole        199  249.99           462
## 17              13          Display        201  140.00             4
## 18              23      Accessories        301   20.99            30
## 19              21      Accessories        302    8.50            25
## 20             133         Software        303   70.99            29
## 21              67          Printer        304  199.99            88
## 22              13  PrinterSupplies        305   20.99             5
## 23              12 ExtendedWarranty        306   99.99             0
## 24            1967      GameConsole        307  425.00          1525
##    x4StarReviews x3StarReviews x2StarReviews x1StarReviews
## 1             26            14            14            25
## 2             11            10            10            21
## 3             10             3             3            11
## 4              2             1             1             1
## 5              1             1             3             0
## 6              8             4             1            10
## 7            112            28            31            47
## 8             18             7            22            18
## 9              4             0             1             0
## 10            66            30            21            36
## 11           437           224           160           247
## 12            26            12            16            35
## 13            26            37            33            48
## 14             8             4             4             9
## 15            19            13            20            22
## 16            97            25            17            58
## 17             0             0             0             2
## 18             1             5             0             0
## 19             2             2             4            15
## 20            18             3             1             8
## 21             8             3             1             3
## 22             0             0             0             0
## 23             1             1             1             1
## 24           252            99            56            45
##    PositiveServiceReview NegativeServiceReview Recommendproduct
## 1                     12                     3              0.7
## 2                      7                     5              0.6
## 3                     11                     5              0.8
## 4                      2                     1              0.6
## 5                      0                     1              0.3
## 6                      2                     4              0.6
## 7                     28                    16              0.7
## 8                      5                    16              0.4
## 9                      1                     0              0.7
## 10                    28                     9              0.8
## 11                    90                    23              0.8
## 12                     8                     6              0.4
## 13                    14                     6              0.6
## 14                     4                     1              0.7
## 15                     5                     7              0.6
## 16                    32                    12              0.8
## 17                     1                     1              0.7
## 18                     2                     0              0.9
## 19                     2                     1              0.5
## 20                     4                     2              0.8
## 21                     5                     1              0.8
## 22                     1                     0              1.0
## 23                     0                     3              0.4
## 24                    59                    13              0.9
##    ShippingWeight ProductDepth ProductWidth ProductHeight ProfitMargin
## 1           19.90        20.63        19.25          8.39         0.25
## 2           27.00        21.89        27.01          9.13         0.20
## 3            6.60         8.94        12.80          0.68         0.10
## 4           13.00        16.30        10.80          1.40         0.15
## 5           11.60        16.81        10.90          0.88         0.23
## 6            5.80         8.43        11.42          1.20         0.08
## 7            4.60        10.17         7.28          0.95         0.09
## 8            4.80         8.00        11.70          1.50         0.11
## 9            4.30         7.40        10.40          0.97         0.09
## 10           3.00         7.31         9.50          0.37         0.10
## 11           0.90         5.40         7.60          0.40         0.20
## 12           0.90         2.70         5.20          0.40         0.11
## 13           0.70         2.67         5.33          0.37         0.12
## 14           0.80         2.70         5.30          0.40         0.15
## 15           0.90         2.60         5.00          0.40         0.11
## 16           8.40         6.20        13.20         13.20         0.09
## 17           8.90        13.60        17.60          7.30         0.05
## 18           0.75        10.70        13.10          0.60         0.05
## 19           1.00         7.30         7.00          1.60         0.10
## 20           0.20         8.00         7.00          1.00         0.20
## 21          42.00        17.30        23.50         25.80         0.90
## 22           1.00         4.70         2.90          6.30         0.30
## 23           0.20         0.00         0.00          0.00         0.40
## 24          20.00         8.50         6.00          1.75         0.18
##    Volume
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 7       0
## 8       0
## 9       0
## 10      0
## 11      0
## 12      0
## 13      0
## 14      0
## 15      0
## 16      0
## 17      0
## 18      0
## 19      0
## 20      0
## 21      0
## 22      0
## 23      0
## 24      0
ggplot(predictiongraph, aes(y = predictedvolume, x = ProductType, fill = ProductType)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

write.csv(newproductw_prediction, file="newproductw_prediction.csv", row.names = TRUE)