Multiple Regression in R
Load packages and datasets
Executive Summary
Introduction
We were requested to conduct further analysis regarding an earlier data mining operation, in which we used RapidMiner to predict projected sales volumes and subsequent profitability for a shortlist of products considered as possible additions to Blackwell’s selection.
Specifically, we were requested to pick up where we left off, by providing a more elaborate analysis based on the same data. For this analysis, we were asked to explore what, if any, influence the respective product types have on the projecting of sales volumes. Furthermore, we were requested to:
- Predict sales volumes of the following product types:
- PC;
- Laptops;
- Netbooks; and
- Smartphones
- Assess the impact that services’ and customer satisfaction
The purpose of this summary is to provide the reader with an overview of our key findings and consists appropriately of weighted conclusions. We refer you to the Technical Report that is appended hereto for technical documentation and argumentation.
About the data
We implemented our analysis on the same historical data that was used as basis for our projection regarding the envisaged new products. Consequently, the dataset imposes the same limitations as they did in regard of predicting the sales volumes of new products.
The data consisted of 80 observations of 18 attributes in regard of products sold by Blackwell. Our tests show that the most influential attributes in predicting sales were the attributes related to customer sentiment as is evidenced in the technical report.
We note that the data is unevenly distributed and contained a few abnormalities. There were also a few obvious errors from the data collection phase as well as missing data which resulted in the deletion of a number of observations and one attribute, which we would have been glad to avoid considering the sparsity of the dataset.
Results
We began by analyzing the influence of the attribute “ProductType” on predicting sales volume. We did this through a variance-analysis method, namely ANOVA. The analysis showed that said attribute had no significance toward an inferential analysis of sales.
This task was a so-called multiple-regression task. We trained the number of supervised-learning algorithms and received the best results with a decision-tree algorithm called Random Forest. Our training yielded a model that inferred the volume from the data with a 92% accuracy.
Prediction
The projected sales volumes for the product types are presented in the following table.
| ProductType | ProductNum | Brand | Volume |
|---|---|---|---|
| Netbook | 180 | Acer | 1191 |
| Smartphone | 194 | Samsung | 899 |
| Smartphone | 193 | Motorola | 261 |
| PC | 171 | Dell | 217 |
| Laptop | 173 | Apple | 177 |
| Netbook | 181 | Asus | 159 |
| PC | 172 | Dell | 104 |
| Smartphone | 196 | Motorola | 95 |
| Smartphone | 195 | HTC | 84 |
| Netbook | 178 | HP | 62 |
| Laptop | 175 | Toshiba | 54 |
| Netbook | 183 | Samsung | 43 |
| Laptop | 176 | Razer | 19 |
Discussion
As the data and available algorithms were the same as last time and so were the algorithms available to us, there was no material difference nor material knowledge gains achieved by this exercise. The central take-away is that, in relation to this dataset, product type carries little weight in predicting sales.
Visualization and data exploration:
plotprod <- products[, c(3:11, 13:18)]
for (i in names(plotprod[, -which(names(plotprod) == "Volume")])) {
print(ggplot(data = plotprod, aes_string(x = i, y = plotprod$Volume)) +
geom_jitter(color = "darkred") + ylab("Volume"))
}We can infer the following already from the visual exploration of the dataset:
- x5StarReviews has a suspiciously strong positive correlation with Volume;
- x4StarReviews has a very strong positive correlation with Volume;
- all StarReviews are positively correlated with Volume, even 1 and 2 starreviews, which are generally considered as expressions of negative customer sentiment. This observation suggests that it is the number of reviews rather than the quality of the same that is related to Volume.
- Attributes relating to physical or financial aspects of the products are largely irrelevant for the purposes of inferential statistics.
Data preprocessing
# Check duplicated rows
sum(duplicated(products[, -which(names(products) %in% c("ProductNum", "Price"))]))[1] 6
# 6 rows from the extended warranty are duplicated, so we'll remove them
# (but if we search manually we can see that here are 7 of them)
products <- products[-c(35:41), ][1] TRUE
ProductType ProductNum Price x5StarReviews
Length:73 Min. :101 Min. : 3.60 Min. : 0.0
Class :character 1st Qu.:119 1st Qu.: 49.99 1st Qu.: 8.0
Mode :character Median :144 Median : 129.00 Median : 29.0
Mean :143 Mean : 253.06 Mean : 163.6
3rd Qu.:162 3rd Qu.: 379.99 3rd Qu.: 164.0
Max. :200 Max. :2249.99 Max. :2801.0
x4StarReviews x3StarReviews x2StarReviews x1StarReviews
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 2.00 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 2.00
Median : 17.00 Median : 6.00 Median : 3.00 Median : 6.00
Mean : 41.47 Mean : 15.44 Mean : 14.82 Mean : 39.85
3rd Qu.: 37.00 3rd Qu.: 12.00 3rd Qu.: 8.00 3rd Qu.: 16.00
Max. :431.00 Max. :162.00 Max. :370.00 Max. :1654.00
PositiveServiceReview NegativeServiceReview Recommendproduct
Min. : 0.00 Min. : 0.000 Min. :0.1000
1st Qu.: 2.00 1st Qu.: 1.000 1st Qu.:0.7000
Median : 5.00 Median : 2.000 Median :0.8000
Mean : 29.86 Mean : 6.055 Mean :0.7301
3rd Qu.: 13.00 3rd Qu.: 4.000 3rd Qu.:0.9000
Max. :536.00 Max. :112.000 Max. :1.0000
BestSellersRank ShippingWeight ProductDepth ProductWidth
Min. : 1.00 Min. : 0.01 Min. : 0.00 Min. : 0.000
1st Qu.: 10.25 1st Qu.: 0.90 1st Qu.: 5.90 1st Qu.: 3.000
Median : 49.00 Median : 2.40 Median : 8.80 Median : 7.600
Mean : 1261.40 Mean :10.58 Mean : 15.81 Mean : 8.569
3rd Qu.: 386.50 3rd Qu.:13.00 3rd Qu.: 15.50 3rd Qu.:11.700
Max. :17502.00 Max. :63.00 Max. :300.00 Max. :31.750
NA's :15
ProductHeight ProfitMargin Volume
Min. : 0.000 Min. :0.050 Min. : 0.0
1st Qu.: 0.600 1st Qu.:0.050 1st Qu.: 32.0
Median : 5.200 Median :0.110 Median : 116.0
Mean : 6.859 Mean :0.131 Mean : 654.5
3rd Qu.:11.500 3rd Qu.:0.170 3rd Qu.: 656.0
Max. :25.800 Max. :0.400 Max. :11204.0
# There are missing values in the 12 column, which is 'BestSellersRank'.
# There are 15 missing values in Best Sellers Rank attribute, so we'll
# remove it.
products <- products[, -which(names(products) %in% "BestSellersRank")]
# Check outliers
boxplot(products$Volume)We can observe that there are missing values in the “BestSellerRank” column. As we cannot predict or obtain the value of this column based on the other features. We are also removing the two biggest outliers that make the biggest impact, corresponding to the volumes of 11204 and 7036.
Df Sum Sq Mean Sq F value Pr(>F)
ProductType 11 5075621 461420 1.473 0.166
Residuals 59 18487579 313349
The PValue is too big, so our categorical variables have no relation between the dependent variable (Volume). This means we are not considering the ProductType for our model but we will store it as a vector for later uses (plots). ## Feature Selection
ProductType <- as.vector(products$ProductType)
products <- products[, -which(colnames(products) %in% "ProductType")]
# Correlation Matrix
corr_products <- cor(products)
# Colinearity:
colinear <- findCorrelation(x = corr_products, cutoff = 0.8, names = T)
colinear[1] "x5StarReviews" "Volume" "x3StarReviews"
[4] "x2StarReviews" "NegativeServiceReview"
pairwiseCor <- function(dataframe) {
pairs <- combn(names(dataframe), 2, simplify = FALSE)
df <- data.frame(Variable1 = rep(0, length(pairs)), Variable2 = rep(0, length(pairs)),
AbsCor = rep(0, length(pairs)), Cor = rep(0, length(pairs)))
for (i in 1:length(pairs)) {
df[i, 1] <- pairs[[i]][1]
df[i, 2] <- pairs[[i]][2]
df[i, 3] <- round(abs(cor(dataframe[, pairs[[i]][1]], dataframe[, pairs[[i]][2]])),
4)
df[i, 4] <- round(cor(dataframe[, pairs[[i]][1]], dataframe[, pairs[[i]][2]]),
4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing = TRUE),
]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
# x5StarReviews has perfect correlation, and we'll remove it. There's also
# colinearity between x4StarReviews and x3StarReviews, so we'll remove
# x3StarReviews. We do the same for x2StarReviews and x1StarReviews.
pairw <- (pairwiseCor(products))
kable(pairw[which(pairw$Variable2 == "Volume"), ])| Variable1 | Variable2 | AbsCor | Cor | |
|---|---|---|---|---|
| 1 | x5StarReviews | Volume | 1.0000 | 1.0000 |
| 8 | x4StarReviews | Volume | 0.8041 | 0.8041 |
| 12 | x3StarReviews | Volume | 0.6864 | 0.6864 |
| 18 | PositiveServiceReview | Volume | 0.5658 | 0.5658 |
| 22 | x2StarReviews | Volume | 0.5159 | 0.5159 |
| 25 | NegativeServiceReview | Volume | 0.4997 | 0.4997 |
| 28 | x1StarReviews | Volume | 0.4124 | 0.4124 |
| 34 | ProductDepth | Volume | 0.3203 | 0.3203 |
| 38 | ShippingWeight | Volume | 0.2690 | -0.2690 |
| 45 | Recommendproduct | Volume | 0.1834 | 0.1834 |
| 49 | Price | Volume | 0.1742 | -0.1742 |
| 51 | ProfitMargin | Volume | 0.1681 | -0.1681 |
| 69 | ProductNum | Volume | 0.0989 | 0.0989 |
| 78 | ProductWidth | Volume | 0.0900 | -0.0900 |
| 97 | ProductHeight | Volume | 0.0421 | -0.0421 |
Here we can observe that the reviews are highly correlated with each other. To sum up, we are removing the x3StarReviews and the x1StarReviews, as they are highly correlated to x4StarReviews and x2StarReviews respectively. Furthermore, we will plot a decision tree to see the variables that have the biggest impact.
decisiontree <- ctree(Volume ~ ., data = products[, -which(colnames(products) %in%
c("x5StarReviews", "x3StarReviews", "x1StarReviews"))], controls = ctree_control(maxdepth = 3))
plot(decisiontree)We can observe that the variables that have the biggest impact are x4StarReviews and PositiveServiceReview, which are also the ones that have the highest correlation.
Here we create a new variable with the StarReviews. We will do this by using a linear regression and we’ll create an “average weighted star review” based on the coefficients of the regression model and its respectively variable.
lm_model <- train(Volume ~ x4StarReviews + x3StarReviews + x2StarReviews + x1StarReviews,
products, method = "lm")
summary(lm_model)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.8996017 50.904372 1.82498276 7.252930e-02
x4StarReviews 14.2183632 2.572673 5.52668819 5.977552e-07
x3StarReviews -14.7529735 9.979575 -1.47831683 1.440798e-01
x2StarReviews 3.2275066 12.659522 0.25494696 7.995568e-01
x1StarReviews 0.0484845 2.178431 0.02225662 9.823104e-01
products$Avg_WghtStar <- summary(lm_model)$coefficients[2] * products$x4StarReviews +
summary(lm_model)$coefficients[3] * products$x3StarReviews + summary(lm_model)$coefficients[4] *
products$x2StarReviews + summary(lm_model)$coefficients[5] * products$x1StarReviews| Variable1 | Variable2 | AbsCor | Cor | |
|---|---|---|---|---|
| 10 | Volume | Avg_WghtStar | 0.8163 | 0.8163 |
| Variable1 | Variable2 | AbsCor | Cor | |
|---|---|---|---|---|
| 12 | x4StarReviews | Volume | 0.8041 | 0.8041 |
Here we can observe that the new variable has better correlation than x4StarReviews, which was the one with the highest relationship with volume.
# Decision Tree
avg_decisiontree <- ctree(Volume ~ ., data = products[, -which(colnames(products) %in%
c("x5StarReviews", "x4StarReviews", "x3StarReviews", "x2StarReviews", "x1StarReviews"))],
controls = ctree_control(maxdepth = 3))
plot(avg_decisiontree)Modeling
We create a loop to train several features with several models.
# Cross validation:
set.seed(69)
indexing <- createDataPartition(products$Volume, p = 0.75, list = F)
trainSet <- products[indexing, ]
testSet <- products[-indexing, ]
form <- c("Volume ~ x4StarReviews + PositiveServiceReview", "Volume ~ Avg_WghtStar + PositiveServiceReview")
models <- c("rf", "knn", "svmLinear", "svmRadial", "glm")
combined <- c()
cnames <- vector()
for (i in form) {
for (j in models) {
model <- train(formula(i), data = trainSet, method = j, tuneLength = 3,
metric = "MAE")
predictions <- predict(model, testSet)
results <- postResample(predictions, testSet$Volume)
combined <- cbind(results, combined)
cnames <- c(paste(i, j), cnames)
}
}note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
[1] 123.048
| Volume ~ Avg_WghtStar + PositiveServiceReview glm | Volume ~ Avg_WghtStar + PositiveServiceReview svmRadial | Volume ~ Avg_WghtStar + PositiveServiceReview svmLinear | Volume ~ Avg_WghtStar + PositiveServiceReview knn | Volume ~ Avg_WghtStar + PositiveServiceReview rf | Volume ~ x4StarReviews + PositiveServiceReview glm | Volume ~ x4StarReviews + PositiveServiceReview svmRadial | Volume ~ x4StarReviews + PositiveServiceReview svmLinear | Volume ~ x4StarReviews + PositiveServiceReview knn | Volume ~ x4StarReviews + PositiveServiceReview rf | |
|---|---|---|---|---|---|---|---|---|---|---|
| RMSE | 359.7378551 | 417.5630488 | 349.2470074 | 398.5383294 | 218.6926097 | 395.2775566 | 432.1760144 | 377.3323953 | 249.506351 | 219.9479104 |
| Rsquared | 0.7569293 | 0.7671428 | 0.7995331 | 0.7991097 | 0.9215047 | 0.6851846 | 0.7425961 | 0.7309845 | 0.939719 | 0.9186288 |
| MAE | 197.1320486 | 248.5718783 | 170.4095309 | 227.7000000 | 123.0479857 | 233.7248491 | 259.8334941 | 187.7202784 | 134.294048 | 125.8509355 |
# We create the new avg_wght star variable for our predictions
newproducts$Avg_WghtStar <- summary(lm_model)$coefficients[2] * newproducts$x4StarReviews +
summary(lm_model)$coefficients[3] * newproducts$x3StarReviews + summary(lm_model)$coefficients[4] *
newproducts$x2StarReviews + summary(lm_model)$coefficients[5] * newproducts$x1StarReviews
rf_model <- train(Volume ~ Avg_WghtStar + PositiveServiceReview, trainSet, tuneLength = 3,
metric = "MAE")note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
products$Volume[products$Volume == 0] <- 1
Volume <- as.numeric(products$Volume)
ex_preds <- as.numeric(predict(rf_model, products))
ae_errors <- as.numeric(abs(ex_preds - products$Volume))
re_errors <- as.numeric(ae_errors/products$Volume)
errors_df <- as.data.frame(cbind(Volume, ex_preds, ae_errors, re_errors))
errors_df$ProductType <- ProductType
errors_df$ProductNum <- products$ProductNum
ggplot(errors_df, aes(x = Volume, y = re_errors, color = ProductType)) + geom_jitter() +
ylab("Relative error") + ggtitle("Relative error vs Volume")ggplot(errors_df, aes(x = Volume, y = ae_errors, color = ProductType)) + geom_jitter() +
ylab("Absolute error") + ggtitle("Absolute error vs Volume")ccols <- c(volume = "#f04546", pred = "#3591d1")
ggplot(errors_df, aes()) + geom_jitter(aes(x = ProductNum, y = ex_preds, color = "pred")) +
geom_jitter(aes(x = ProductNum, y = Volume, color = "volume")) + geom_smooth(aes(x = ProductNum,
y = ex_preds, color = "pred"), method = "loess", se = F) + geom_smooth(aes(x = ProductNum,
y = Volume, color = "volume"), method = "loess", se = F) + ylab("Volume")As we can see from the plots, the relative error is at its largest at low volumes, which is nearly always the case as relative error is the absolute error as a fraction of the observation. The absolute error is greater at greater volumes.
filtered <- newproducts[which(newproducts$ProductType == "PC" | newproducts$ProductType ==
"Laptop" | newproducts$ProductType == "Netbook" | newproducts$ProductType ==
"Smartphone"), ]
finalresults <- filtered[, which(colnames(filtered) %in% c("ProductType", "ProductNum",
"Volume"))]
finalresults$Volume <- round(finalresults$Volume, 0)
finalresults$Volume <- as.integer(finalresults$Volume)
kable(finalresults, caption = "Volume of the new products")| ProductType | ProductNum | Volume |
|---|---|---|
| PC | 171 | 217 |
| PC | 172 | 104 |
| Laptop | 173 | 177 |
| Laptop | 175 | 54 |
| Laptop | 176 | 19 |
| Netbook | 178 | 62 |
| Netbook | 180 | 1191 |
| Netbook | 181 | 159 |
| Netbook | 183 | 43 |
| Smartphone | 193 | 261 |
| Smartphone | 194 | 899 |
| Smartphone | 195 | 84 |
| Smartphone | 196 | 95 |