ContextBase Logo



Table of Contents

Section 1 - The Problem
Section 2 - The Solution

Section 3 - Data Import
Section 4 - Exploratory Data Analysis

Section 5 - Data Transformation
Section 6 - Feature Detection
Section 7 - Outlier Removal

Section 8 - Modeling
Section 9 - Predicting Sales on New Products
Section 10 - Impact of Customer and Service Reviews on Sales Volume

Section 11 - Conclusions

Section 12 - Appendix
Section 12.1 - Required Packages
Section 12.2 - Session Information



Section 1 - The Problem

Sales Teams need to better understand how new types of Products might impact Sales across the Enterprise, and how specific new Product Types perform against each other. Therefore, prediction of sales volume for new Product Types is required, even though the only data available for analysis is historical sales data.



Section 2 - The Solution

The solution this document proposes is to use data of historical sales and analyze the impact of services reviews and customer reviews on historical products to predict sales volume of new products. Exploratory Data Analysis, Data Transformation, Feature Detection, Outlier Removal, and Predictive Modeling are applied, and the results presented in tabular and graphic forms. Four different product types: PC, Laptops, Netbooks and Smartphones are assessed.



Section 3 - Data Import

Existing products and new products datasets imported are in the .csv format. These datasets contain product type, number, price, reviews, recommendation, rank, weight, dimension, profit margin, and volume data for a range of electronic products sold by the client.

existing_products <- read.csv("Assignment 1/Assignment 1 Data/existingproductattributes2017.csv")

new_products <- read.csv("Assignment 1/Assignment 1 Data/newproductattributes2017.csv")



Section 4 - Exploratory Data Analysis

The Data Science analysis of the imported datasets begins with exploration of the categories of observations, and summary statitics of the datasets.

kable(existing_products[1:5,], caption = "Table 1: Sample of Existing Products Records")
Table 1: Sample of Existing Products Records
ProductType ProductNum Price x5StarReviews x4StarReviews x3StarReviews x2StarReviews x1StarReviews PositiveServiceReview NegativeServiceReview Recommendproduct BestSellersRank ShippingWeight ProductDepth ProductWidth ProductHeight ProfitMargin Volume
PC 101 949.00 3 3 2 0 0 2 0 0.9 1967 25.8 23.94 6.62 16.89 0.15 12
PC 102 2249.99 2 1 0 0 0 1 0 0.9 4806 50.0 35.00 31.75 19.00 0.25 8
PC 103 399.00 3 0 0 0 0 1 0 0.9 12076 17.4 10.50 8.30 10.20 0.08 12
Laptop 104 409.99 49 19 8 3 9 7 8 0.8 109 5.7 15.00 9.90 1.30 0.08 196
Laptop 105 1079.99 58 31 11 7 36 7 20 0.7 268 7.0 12.90 0.30 8.90 0.09 232
kable(new_products[1:5,], caption = "Table 2: Sample of New Products Records")
Table 2: Sample of New Products Records
ProductType ProductNum Price x5StarReviews x4StarReviews x3StarReviews x2StarReviews x1StarReviews PositiveServiceReview NegativeServiceReview Recommendproduct BestSellersRank ShippingWeight ProductDepth ProductWidth ProductHeight ProfitMargin Volume
PC 171 699 96 26 14 14 25 12 3 0.7 2498 19.9 20.63 19.25 8.39 0.25 0
PC 172 860 51 11 10 10 21 7 5 0.6 490 27.0 21.89 27.01 9.13 0.20 0
Laptop 173 1199 74 10 3 3 11 11 5 0.8 111 6.6 8.94 12.80 0.68 0.10 0
Laptop 175 1199 7 2 1 1 1 2 1 0.6 4446 13.0 16.30 10.80 1.40 0.15 0
Laptop 176 1999 1 1 1 3 0 0 1 0.3 2820 11.6 16.81 10.90 0.88 0.23 0
# Summary
kable(summary(existing_products), caption = "Table 3: Summary Statistics of Existing Products")
Table 3: Summary Statistics of Existing Products
ProductType ProductNum Price x5StarReviews x4StarReviews x3StarReviews x2StarReviews x1StarReviews PositiveServiceReview NegativeServiceReview Recommendproduct BestSellersRank ShippingWeight ProductDepth ProductWidth ProductHeight ProfitMargin Volume
Accessories :26 Min. :101.0 Min. : 3.60 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. :0.100 Min. : 1 Min. : 0.0100 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.0500 Min. : 0
Printer :12 1st Qu.:120.8 1st Qu.: 52.66 1st Qu.: 10.0 1st Qu.: 2.75 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 2.00 1st Qu.: 2.00 1st Qu.: 1.000 1st Qu.:0.700 1st Qu.: 7 1st Qu.: 0.5125 1st Qu.: 4.775 1st Qu.: 1.750 1st Qu.: 0.400 1st Qu.:0.0500 1st Qu.: 40
ExtendedWarranty:10 Median :140.5 Median : 132.72 Median : 50.0 Median : 22.00 Median : 7.00 Median : 3.00 Median : 8.50 Median : 5.50 Median : 3.000 Median :0.800 Median : 27 Median : 2.1000 Median : 7.950 Median : 6.800 Median : 3.950 Median :0.1200 Median : 200
Software : 6 Mean :142.6 Mean : 247.25 Mean : 176.2 Mean : 40.20 Mean : 14.79 Mean : 13.79 Mean : 37.67 Mean : 51.75 Mean : 6.225 Mean :0.745 Mean : 1126 Mean : 9.6681 Mean : 14.425 Mean : 7.819 Mean : 6.259 Mean :0.1545 Mean : 705
Display : 5 3rd Qu.:160.2 3rd Qu.: 352.49 3rd Qu.: 306.5 3rd Qu.: 33.00 3rd Qu.: 11.25 3rd Qu.: 7.00 3rd Qu.: 15.25 3rd Qu.: 42.00 3rd Qu.: 6.250 3rd Qu.:0.900 3rd Qu.: 281 3rd Qu.:11.2050 3rd Qu.: 15.025 3rd Qu.:11.275 3rd Qu.:10.300 3rd Qu.:0.2000 3rd Qu.: 1226
PC : 4 Max. :200.0 Max. :2249.99 Max. :2801.0 Max. :431.00 Max. :162.00 Max. :370.00 Max. :1654.00 Max. :536.00 Max. :112.000 Max. :1.000 Max. :17502 Max. :63.0000 Max. :300.000 Max. :31.750 Max. :25.800 Max. :0.4000 Max. :11204
(Other) :17 NA NA NA NA NA NA NA NA NA NA NA’s :15 NA NA NA NA NA NA
kable(summary(new_products), caption = "Table 4: Summary Statistics of New Products")
Table 4: Summary Statistics of New Products
ProductType ProductNum Price x5StarReviews x4StarReviews x3StarReviews x2StarReviews x1StarReviews PositiveServiceReview NegativeServiceReview Recommendproduct BestSellersRank ShippingWeight ProductDepth ProductWidth ProductHeight ProfitMargin Volume
Netbook :4 Min. :171.0 Min. : 8.5 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. :0.3000 Min. : 1.00 Min. : 0.200 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.0500 Min. :0
Smartphone :4 1st Qu.:179.5 1st Qu.: 130.0 1st Qu.: 16.00 1st Qu.: 2.00 1st Qu.: 1.75 1st Qu.: 1.00 1st Qu.: 1.75 1st Qu.: 2.00 1st Qu.: 1.000 1st Qu.:0.6000 1st Qu.: 93.25 1st Qu.: 0.900 1st Qu.: 5.225 1st Qu.: 5.832 1st Qu.: 0.400 1st Qu.:0.0975 1st Qu.:0
Laptop :3 Median :193.5 Median : 275.0 Median : 46.00 Median : 10.50 Median : 4.50 Median : 4.00 Median : 13.00 Median : 5.00 Median : 3.500 Median :0.7000 Median : 750.50 Median : 4.450 Median : 8.000 Median : 9.950 Median : 0.985 Median :0.1150 Median :0
Accessories:2 Mean :219.5 Mean : 425.6 Mean : 178.50 Mean : 48.04 Mean : 21.92 Mean : 17.50 Mean : 27.58 Mean :13.46 Mean : 5.667 Mean :0.6708 Mean : 3957.62 Mean : 7.802 Mean : 9.094 Mean :10.408 Mean : 3.541 Mean :0.1817 Mean :0
GameConsole:2 3rd Qu.:301.2 3rd Qu.: 486.5 3rd Qu.: 99.25 3rd Qu.: 26.00 3rd Qu.: 16.75 3rd Qu.: 20.25 3rd Qu.: 35.25 3rd Qu.:12.50 3rd Qu.: 7.500 3rd Qu.:0.8000 3rd Qu.: 3150.00 3rd Qu.: 9.575 3rd Qu.:11.425 3rd Qu.:12.875 3rd Qu.: 2.888 3rd Qu.:0.2000 3rd Qu.:0
PC :2 Max. :307.0 Max. :1999.0 Max. :1525.00 Max. :437.00 Max. :224.00 Max. :160.00 Max. :247.00 Max. :90.00 Max. :23.000 Max. :1.0000 Max. :44465.00 Max. :42.000 Max. :21.890 Max. :27.010 Max. :25.800 Max. :0.9000 Max. :0
(Other) :7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# checking product types
kable(unique(existing_products$ProductType), caption = "Table 5: Existing Product Types")
Table 5: Existing Product Types
x
PC
Laptop
Accessories
Software
Display
Printer
PrinterSupplies
ExtendedWarranty
Netbook
Tablet
Smartphone
GameConsole
kable(unique(new_products$ProductType), caption = "Table 6: New Product Types")
Table 6: New Product Types
x
PC
Laptop
Netbook
Tablet
Smartphone
GameConsole
Display
Accessories
Software
Printer
PrinterSupplies
ExtendedWarranty



Section 5 - Data Transformation

In this section, the “Extended Guarantee” category is de-duplicated, and “not available” data is found. All 15 NAs are found in the “BestSellerRank” dataset column. This represent 20% of the total observations. For this reason the attribute was removed from the analysis.

New data categories are created in order to accomplish “full rank parameterization” that is required for accurate correlation and predictive analytics. Correlations between attributes are then found, and displayed via correlation plotting.

# De-duplicate "Extended Guarantee" category
existing_products2 <- existing_products[c(34:41),]
existing_products2 <- aggregate(.~ProductType, data=existing_products2, FUN=mean)
existing_products <- rbind(existing_products[-c(34:41),],existing_products2)
rm(existing_products2)

# Finds missing attrbiutes
sapply(existing_products, function(x) {sum(is.na(x))})  
##           ProductType            ProductNum                 Price 
##                     0                     0                     0 
##         x5StarReviews         x4StarReviews         x3StarReviews 
##                     0                     0                     0 
##         x2StarReviews         x1StarReviews PositiveServiceReview 
##                     0                     0                     0 
## NegativeServiceReview      Recommendproduct       BestSellersRank 
##                     0                     0                    15 
##        ShippingWeight          ProductDepth          ProductWidth 
##                     0                     0                     0 
##         ProductHeight          ProfitMargin                Volume 
##                     0                     0                     0
# The BestSellerRank attribute is removed from the analysis
existing_products$BestSellersRank <- NULL

# Creates new data categories for "full rank parameterization" required for accurate correlation and predictive analytics.
newDataframe <- dummyVars(" ~ .", data = existing_products)
readyData <- data.frame(predict(newDataframe, newdata = existing_products))
      
# Finding Correlation between attributes
corrData <- cor(readyData)

# Displays a correlation plot
print(paste("Graph 1: Correlation Plot"))
## [1] "Graph 1: Correlation Plot"
corrplot(corrData)



Section 6 - Feature Detection

The features related to Service Reviews and Product Reviews were detected as the most correlated to predict Volume. To avoid overfitting the predictive analytics, the following thresholds were applied:
1) Independert variables with correlation > 0.95 with the dependent variable were removed.
2) Independent variables with strong correlation between themselves (> 0.85) were removed, keeping only the variables with strongest correlation with the volume.

# High correlation independent variables are selected.
keeps <- c("Volume", "x4StarReviews", "PositiveServiceReview")

# Low correlation independent variables are removed.
drops <- c("x5StarReviews", "x3StarReviews", "x2StarReviews", "x1StarReviews", "NegativeServiceReview")

# Bar Plot of Existing Products, Reviews x Volume
ggplot(existing_products, aes(x=x4StarReviews, y=Volume)) +
  geom_boxplot() +
  ggtitle("Graph 2: Bar Plot of Existing Products, Reviews x Volume")

# Plot of Existing Product Types, Reviews x Volume
ggplot(data = existing_products) +
  geom_smooth(mapping = aes(x=x4StarReviews, y=Volume)) +
  geom_point(mapping = aes(x=x4StarReviews, y=Volume,
                           color=ProductType)) +
  ggtitle("Graph 3: Plot of Existing Product Types, Reviews x Volume")



Section 7 - Outlier Removal

In this section of data pre-processing, the outliers with volume greater than 6000 are removed from the existing products dataset. Then, in order to specify the remaining outlier removal a list of product types for examination is created. 4 star reviews less than 100 and products with a volume of greater than 100 are removed, with the exception of the product type, “Smartphones”.

# remove outliers with volume > 6000
existing_products <- existing_products[-which(existing_products$Volume > 6000), ]

# create list with important product types
producttypes <- c("PC", "Laptop", "Netbook", "Smartphone")

# remove outliers, except smartphone
existing_products <- existing_products[-which(existing_products$x4StarReviews < 100 & existing_products$Volume > 1000 & as.character(existing_products$ProductType) != producttypes[4]),]



Section 8 - Modeling

Cross-validation was used to validate the prediction method, via the creation of training and test datasets for predictive analytics processing. The predictive analytic methods reviewed are:
1) Linear Modeling
2) Support Vector Machines
3) Random Forest

The accuracy of the predictive analytic methods is assessed with the statistics of “Root Mean Squared (RMSE)”, “RSquared”, and “Mean Absolute Error (MAE)”.

RMSE quantifies the error in the measuring unit of the variable. In such cases we need to compute indexes that average the residuals of the model. Because the residuals are both positive and negative and their distribution is nearly symmetrical, their average will always be zero. Therefore, RMSE averages the squared residuals, or the square root of the mean of the squared residuals.

R-Squared is possibly the most commonly used measure of accuracy, and explains the percentage of variance in the target variable. R-Squared is the ratio of the regression sum of squares and the total sum of squares.
MAE is used to mitigate the inflation of the value of RMSE by large residuals. MAE is usually presented with RMSE as an alternate statistics to gain a greater understanding of the accuracy of the prediction model.

# Creates train and test sets for Cross Validation
CrossValidation <- trainControl(method = "repeatedcv", number = 4, repeats = 100)

# Linear Model Predictive Analytics
set.seed(123)
LMFit <- train(Volume~., data = existing_products[keeps], method = "lm", trControl = CrossValidation, preProc = c("center","scale"))

# Table 7: Review the accuracy of the Linear Model fit
LMFit
## Linear Regression 
## 
## 65 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (4 fold, repeated 100 times) 
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   244.1621  0.7721457  118.5894
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Support Vector Machines Predictive Analytics
set.seed(123)
SVMFit <- train(Volume~., data = existing_products[keeps], method = "svmLinear2", trControl = CrossValidation, preProc = c("center","scale"))

# Table 8: Review the accuracy of the Support Vector Machines fit
SVMFit
## Support Vector Machines with Linear Kernel 
## 
## 65 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (4 fold, repeated 100 times) 
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ... 
## Resampling results across tuning parameters:
## 
##   cost  RMSE      Rsquared   MAE     
##   0.25  231.0488  0.7741247  107.1101
##   0.50  221.8023  0.7846922  103.7282
##   1.00  216.1003  0.7910762  101.7691
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 1.
# Random Forest Predictive Analytics --- Selected --- 
set.seed(123)
RFFit <- train(Volume~., data = existing_products[keeps], method = "rf", trControl = CrossValidation, preProc = c("center","scale"))
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
# Table 9: Review the accuracy of the Random Forest fit
RFFit
## Random Forest 
## 
## 65 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (4 fold, repeated 100 times) 
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   199.8766  0.8449612  99.5098
## 
## Tuning parameter 'mtry' was held constant at a value of 2



Section 9 - Predicting Sales on New Products

This section creates the sales predictions for four target product types, (PC, Laptops, Netbooks and Smartphones), in the new product attributes data set, formats a table of results, and saves the prediction results in .csv format.

# Selects product types within the the scope of this analysis (PC, Laptops, Netbooks and Smartphones).
new_products <- new_products[which(new_products$ProductType %in% producttypes),] 

# Predicts sales volume
Predictions <- predict(RFFit, newdata=new_products[keeps])

# Places the predictions in the "New Products" volume field
new_products$Volume <- round(Predictions)

# Calculate Revenue  
new_products$Revenue  <- round((new_products$Price * new_products$ProfitMargin * new_products$Volume), digits = 2)

# Format a table of sales predictions
SalesPredictionTable <- data.frame(new_products[,1], new_products$Volume, new_products$Revenue)
names(SalesPredictionTable) <- c("New Products", "Volume Predictions", "Revenue Predictions" )

# Print the table of sales predictions
kable(SalesPredictionTable, caption = "Table 10: Sales Forecasts for New Products")
Table 10: Sales Forecasts for New Products
New Products Volume Predictions Revenue Predictions
PC 401 70074.75
PC 130 22360.00
Laptop 201 24099.90
Laptop 52 9352.20
Laptop 6 2758.62
Netbook 51 1631.96
Netbook 1278 37841.58
Netbook 115 5553.35
Netbook 30 891.00
Smartphone 295 6457.55
Smartphone 651 3827.88
Smartphone 82 1832.70
Smartphone 140 4620.00
# Save Predictions in CSV
write.csv(new_products, "new_products_Prediction.csv")



Section 10 - Impact of Customer and Service Reviews on Sales Volume

This section contains visualizations that allow for the assessment of the impact that Services Reviews, and Customer Reviews, have on sales of the entire range of product types.

The first plot is a box plot of the effect of reviews on sales volume. The second plot displays the effect of reviews on sales volume by product type.



Section 11 - Conclusions

The algorithms used for this document include functions for Exploratory Data Analysis that summarize the statistics within the datasets, functions to improve the later predictive analytics via filling in dataset information, functions to find data category correlations, functions for data visualization, and functions to subset the data for outlier removal.

The method of Predictive Analytics used for this document involves Cross-Validation for Machine Learning of trends in the dataset. After the dataset is divided into training and testing datasets, the predicive algorithms used for this document are linear modeling, support vector machines, and random forest, (multiple decision trees).

The Random Forest algorithm was selected with the rationale that the 80% confidence level of the Random Forest algorithm would provide more accurate predictions than the 78.8% confidence level of the Support Vector Machines algorithm, or the 77.2% confidence level of the Linear Modeling algorithm.

Sales predictions were then created for the four target product types found in the new product data set. The results were formatted into a dataframe of predictions.

Lastly, the impact of customer and service reviews on sales volume are displayed visually with the ggplot2 graphics package.



Section 12 - Appendix

Section 12.1 - Required Packages

Table 11: List of Required Packages
List of Required Packages
Required Packages ‘readr’ ‘lattice’ ‘caret’ ‘datasets’ ‘reshape2’ ‘base’ ‘stats’ ‘e1071’ ‘magrittr’ ‘corrplot’ ‘randomForest’ ‘gbm’ ‘ggplot2’ ‘knitr’



Section 12.2 - Session Information

Table 12: Session Information
Session Information
R Version R version 3.6.2 (2019-12-12)
Platform x86_64-w64-mingw32/x64 (64-bit)
Running Windows 10 x64 (build 17763)
RStudio Citation RStudio: Integrated Development Environment for R
RStudio Version 1.0.153