Sergi Calderón and Xènia Fabregat
In this task, Danielle asked us to predict sales volume for a list of new product types, based on historical sales data. For this task, Danielle wanted us to evaluate how product types could impact those sales. However, after performing some analysis we discover that product type has no significant impact on sales volume. This could be due to many factors: the sample is very small and this could skew our predictions, maybe the classification isn’t optimal (we don’t know very well according to which criteria product types are grouped: why are netbooks and laptops in two different groups? What do accessories include and could they be split?),…
This is why, we based our predictions on two other variables: 4 stars review and positive service review. Thanks to these we got the following predictions:
library(readr)
library(ggplot2)
predictiongraph <- read_csv("C:/Users/user/Documents/multiple regression/newproductw_prediction.csv")
net_sales <- (predictiongraph$predictedvolume * predictiongraph$Price)
predictiongraph <- cbind(net_sales, predictiongraph)
ggplot(predictiongraph, aes(y = net_sales, x = ProductType, fill = ProductType)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
According to them it looks like we should invest in Game Consoles, Tablets, Netbooks and PCs. However, these predictions have a low condifence level because, as stated before, the size of the data set on which predictions are based is very small (only 80 observations). Therefore, we would suggest to use these predictions carefully taking this into account and, as a next step, performing again this analysis on a bigger sample.
# Load Libraries
library(caret)
library(corrplot)
library(party)
library(lattice)
library(data.table)
existing_product_attributes2017 <- read_csv("C:/Users/user/Downloads/existingproductattributes2017.csv")
existing_product_attributes2017$BestSellersRank <- NULL
new_product_attributes2017 <- read_csv("C:/Users/user/Downloads/newproductattributes2017.csv")
new_product_attributes2017$BestSellersRank <- NULL
To tackle this task, our first step was to understand and pre process the data we were given, and as Danielle was asking, put special focus on assessing whether Product Type impacted sales volume or not.
We can see that Best Seller rank has missing data, this is why we will dismiss this variable.
Once we have no missing data, we want to find correlations with the different variables, and to do so we will need to dummify our categorical variable: product type. Once this is done, we can spot the following correlations:
dummy_existing1 <- dummyVars(" ~ .", data = existing_product_attributes2017)
dummy_existing <- data.frame(predict(dummy_existing1, newdata = existing_product_attributes2017))
correlation_matrix <- cor(dummy_existing)
corrplot(correlation_matrix, tl.cex = 0.5)
Thanks to this correlation matrix we can see that there are 6 variables that have a high correlation with volume:
5 stars review
4 stars review
3 stars review
Positive service review
2 stars review
Product Type Game Console
However, we can see also that 5 stars review has a 100% correlation with volume, which is impossible and must be by chance, so we can’t use it. Also that 3 and 2 stars review are colinear with 4 stars review. This is why we will remove them and make sure we can use 4 stars review, Positive service review and Product Type Game Console.
dummyok <- subset(existing_product_attributes2017, select = c("ProductType", "Volume", "x4StarReviews", "PositiveServiceReview"))
dummyok$ProductType <- as.factor(dummyok$ProductType)
Out of these 3 remaining variables, we will inspect further the distribution of the variable with the weakest correlation, Product Type.
Let’s have a closer look at Product Type distribution. Indeed if we look at the boxplot distributions, Game console have huge volume, but that’s skewed due to the distribution of the data:
ggplot(dummyok, aes(x = Volume)) + geom_density() + facet_wrap(~ ProductType) # copied from David during the code review
ggplot(existing_product_attributes2017, aes(x= ProductType, y = Volume)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(existing_product_attributes2017, aes(x= ProductType,)) + geom_bar() + stat_count() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
We can see that Game Console has only 2 data points, which is why we can’t rely on it. Furthermore, if we look at all Product Types and volume, we can find no relationship:
summary(aov(Volume ~ ProductType, data = existing_product_attributes2017))
## Df Sum Sq Mean Sq F value Pr(>F)
## ProductType 11 38888822 3535347 1.683 0.0961 .
## Residuals 68 142859034 2100868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we perform an anova analysis we can see that the P value is too high, and this is why we decided not to use Product Type at all, and use only 4 stars reviews and positive service review.
dummyok <- subset(existing_product_attributes2017, select = c("Volume", "x4StarReviews", "PositiveServiceReview"))
To decide on the best model possible, we ran 4 different models (svm, knn, random forest and linear model). We saw that due to the sample size the seed used changed dramatically the performance of the models used, and this is why we tried these 4 models with 10 different seeds each:
for (i in 1:10) {
set.seed(i)
inTrain <- createDataPartition(y = dummyok$Volume, p = .80, list = FALSE)
training <- dummyok[inTrain,]
testing <- dummyok[-inTrain,]
models <- c("svmLinear", "knn", "rf", "lm")
resum <- c()
resumseed <- c()
for (j in models) {
fit <- train(Volume ~ ., data = training, method = j)
predictedvolume <- predict(fit, testing)
hola <- postResample(pred = predictedvolume, obs = testing$Volume)
resum <- cbind(hola, resum)
}
colnames(resum) <-(models)
resumseed <- rbind(resum, resumseed)
}
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
After doing so, we see that knn is the model that gives us a better performance:
resumseedok <- as.data.frame(melt(resumseed))
SVM <- subset.data.frame(resumseedok, resumseedok$Var2 == "svmLinear")
KNN <- subset.data.frame(resumseedok, resumseedok$Var2 == "knn")
RF <- subset.data.frame(resumseedok, resumseedok$Var2 == "rf")
LM <- subset.data.frame(resumseedok, resumseedok$Var2 == "lm")
resumerrormetrics <- cbind(tapply(SVM$value, SVM$Var1, mean), tapply(KNN$value, KNN$Var1, mean), tapply(RF$value, RF$Var1, mean), tapply(LM$value, LM$Var1, mean))
colnames(resumerrormetrics) <-(models)
resumerrormetrics
## svmLinear knn rf lm
## RMSE 653.2535981 244.2500631 331.5873744 463.0169290
## Rsquared 0.6311471 0.8234713 0.7173373 0.6234875
## MAE 358.7889570 108.7636232 184.6688215 213.2620457
And this is why we will use it to do our predictions:
fit <- train(Volume ~ ., data = training, method = "knn", importance=T)
predictedvolume <- round(predict(fit, new_product_attributes2017), digits = 0)
newproductw_prediction <- cbind(predictedvolume, new_product_attributes2017)
newproductw_prediction
## predictedvolume ProductType ProductNum Price x5StarReviews
## 1 315 PC 171 699.00 96
## 2 84 PC 172 860.00 51
## 3 82 Laptop 173 1199.00 74
## 4 21 Laptop 175 1199.00 7
## 5 12 Laptop 176 1999.00 1
## 6 60 Netbook 178 399.99 19
## 7 837 Netbook 180 329.00 312
## 8 133 Netbook 181 439.00 23
## 9 28 Netbook 183 330.00 3
## 10 783 Tablet 186 629.00 296
## 11 1967 Tablet 187 199.00 943
## 12 254 Smartphone 193 199.00 99
## 13 456 Smartphone 194 49.00 100
## 14 67 Smartphone 195 149.00 42
## 15 154 Smartphone 196 300.00 50
## 16 837 GameConsole 199 249.99 462
## 17 13 Display 201 140.00 4
## 18 23 Accessories 301 20.99 30
## 19 21 Accessories 302 8.50 25
## 20 133 Software 303 70.99 29
## 21 67 Printer 304 199.99 88
## 22 13 PrinterSupplies 305 20.99 5
## 23 12 ExtendedWarranty 306 99.99 0
## 24 1967 GameConsole 307 425.00 1525
## x4StarReviews x3StarReviews x2StarReviews x1StarReviews
## 1 26 14 14 25
## 2 11 10 10 21
## 3 10 3 3 11
## 4 2 1 1 1
## 5 1 1 3 0
## 6 8 4 1 10
## 7 112 28 31 47
## 8 18 7 22 18
## 9 4 0 1 0
## 10 66 30 21 36
## 11 437 224 160 247
## 12 26 12 16 35
## 13 26 37 33 48
## 14 8 4 4 9
## 15 19 13 20 22
## 16 97 25 17 58
## 17 0 0 0 2
## 18 1 5 0 0
## 19 2 2 4 15
## 20 18 3 1 8
## 21 8 3 1 3
## 22 0 0 0 0
## 23 1 1 1 1
## 24 252 99 56 45
## PositiveServiceReview NegativeServiceReview Recommendproduct
## 1 12 3 0.7
## 2 7 5 0.6
## 3 11 5 0.8
## 4 2 1 0.6
## 5 0 1 0.3
## 6 2 4 0.6
## 7 28 16 0.7
## 8 5 16 0.4
## 9 1 0 0.7
## 10 28 9 0.8
## 11 90 23 0.8
## 12 8 6 0.4
## 13 14 6 0.6
## 14 4 1 0.7
## 15 5 7 0.6
## 16 32 12 0.8
## 17 1 1 0.7
## 18 2 0 0.9
## 19 2 1 0.5
## 20 4 2 0.8
## 21 5 1 0.8
## 22 1 0 1.0
## 23 0 3 0.4
## 24 59 13 0.9
## ShippingWeight ProductDepth ProductWidth ProductHeight ProfitMargin
## 1 19.90 20.63 19.25 8.39 0.25
## 2 27.00 21.89 27.01 9.13 0.20
## 3 6.60 8.94 12.80 0.68 0.10
## 4 13.00 16.30 10.80 1.40 0.15
## 5 11.60 16.81 10.90 0.88 0.23
## 6 5.80 8.43 11.42 1.20 0.08
## 7 4.60 10.17 7.28 0.95 0.09
## 8 4.80 8.00 11.70 1.50 0.11
## 9 4.30 7.40 10.40 0.97 0.09
## 10 3.00 7.31 9.50 0.37 0.10
## 11 0.90 5.40 7.60 0.40 0.20
## 12 0.90 2.70 5.20 0.40 0.11
## 13 0.70 2.67 5.33 0.37 0.12
## 14 0.80 2.70 5.30 0.40 0.15
## 15 0.90 2.60 5.00 0.40 0.11
## 16 8.40 6.20 13.20 13.20 0.09
## 17 8.90 13.60 17.60 7.30 0.05
## 18 0.75 10.70 13.10 0.60 0.05
## 19 1.00 7.30 7.00 1.60 0.10
## 20 0.20 8.00 7.00 1.00 0.20
## 21 42.00 17.30 23.50 25.80 0.90
## 22 1.00 4.70 2.90 6.30 0.30
## 23 0.20 0.00 0.00 0.00 0.40
## 24 20.00 8.50 6.00 1.75 0.18
## Volume
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
## 11 0
## 12 0
## 13 0
## 14 0
## 15 0
## 16 0
## 17 0
## 18 0
## 19 0
## 20 0
## 21 0
## 22 0
## 23 0
## 24 0
ggplot(predictiongraph, aes(y = predictedvolume, x = ProductType, fill = ProductType)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
write.csv(newproductw_prediction, file="newproductw_prediction.csv", row.names = TRUE)