Sales_Prediction_Report.utf8

Section 1 - The Problem
Section 2 - The Solution

Section 3 - Data Import
Section 4 - Exploratory Data Analysis

Section 5 - Data Transformation
Section 6 - Feature Detection
Section 7 - Outlier Removal

Section 8 - Modeling
Section 9 - Predicting Sales on New Products
Section 10 - Impact of Customer and Service Reviews on Sales Volume

Section 11 - Conclusions

Section 12 - Appendix
Section 12.1 - Required Packages
Section 12.2 - Session Information

Section 1 - The Problem

Sales Teams need to better understand how new types of Products might impact Sales across the Enterprise, and how specific new Product Types perform against each other. Therefore, prediction of sales volume for new Product Types is required, even though the only data available for analysis is historical sales data.

Section 2 - The Solution

The solution this document proposes is to use data of historical sales and analyze the impact of services reviews and customer reviews on historical products to predict sales volume of new products. Exploratory Data Analysis, Data Transformation, Feature Detection, Outlier Removal, and Predictive Modeling are applied, and the results presented in tabular and graphic forms. Four different product types: PC, Laptops, Netbooks and Smartphones are assessed.

Section 3 - Data Import

Existing products and new products datasets imported are in the .csv format. These datasets contain product type, number, price, reviews, recommendation, rank, weight, dimension, profit margin, and volume data for a range of electronic products sold by the client.

existing_products <- read.csv("Assignment 1/Assignment 1 Data/existingproductattributes2017.csv")

new_products <- read.csv("Assignment 1/Assignment 1 Data/newproductattributes2017.csv")

Section 4 - Exploratory Data Analysis

The Data Science analysis of the imported datasets begins with exploration of the categories of observations, and summary statitics of the datasets.

kable(existing_products[1:5,], caption = "Table 1: Sample of Existing Products Records")

Table 1: Sample of Existing Products Records
ProductType	ProductNum	Price	x5StarReviews	x4StarReviews	x3StarReviews	x2StarReviews	x1StarReviews	PositiveServiceReview	NegativeServiceReview	Recommendproduct	BestSellersRank	ShippingWeight	ProductDepth	ProductWidth	ProductHeight	ProfitMargin	Volume
PC	101	949.00	3	3	2	0	0	2	0	0.9	1967	25.8	23.94	6.62	16.89	0.15	12
PC	102	2249.99	2	1	0	0	0	1	0	0.9	4806	50.0	35.00	31.75	19.00	0.25	8
PC	103	399.00	3	0	0	0	0	1	0	0.9	12076	17.4	10.50	8.30	10.20	0.08	12
Laptop	104	409.99	49	19	8	3	9	7	8	0.8	109	5.7	15.00	9.90	1.30	0.08	196
Laptop	105	1079.99	58	31	11	7	36	7	20	0.7	268	7.0	12.90	0.30	8.90	0.09	232

kable(new_products[1:5,], caption = "Table 2: Sample of New Products Records")

Table 2: Sample of New Products Records
ProductType	ProductNum	Price	x5StarReviews	x4StarReviews	x3StarReviews	x2StarReviews	x1StarReviews	PositiveServiceReview	NegativeServiceReview	Recommendproduct	BestSellersRank	ShippingWeight	ProductDepth	ProductWidth	ProductHeight	ProfitMargin
PC	171	699	96	26	14	14	25	12	3	0.7	2498	19.9	20.63	19.25	8.39	0.25
PC	172	860	51	11	10	10	21	7	5	0.6	490	27.0	21.89	27.01	9.13	0.20
Laptop	173	1199	74	10	3	3	11	11	5	0.8	111	6.6	8.94	12.80	0.68	0.10
Laptop	175	1199	7	2	1	1	1	2	1	0.6	4446	13.0	16.30	10.80	1.40	0.15
Laptop	176	1999	1	1	1	3	0	0	1	0.3	2820	11.6	16.81	10.90	0.88	0.23

# Summary
kable(summary(existing_products), caption = "Table 3: Summary Statistics of Existing Products")

Table 3: Summary Statistics of Existing Products
ProductType	ProductNum	Price	x5StarReviews	x4StarReviews	x3StarReviews	x2StarReviews	x1StarReviews	PositiveServiceReview	NegativeServiceReview	Recommendproduct	BestSellersRank	ShippingWeight	ProductDepth	ProductWidth	ProductHeight	ProfitMargin	Volume
Accessories :26	Min. :101.0	Min. : 3.60	Min. : 0.0	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.000	Min. :0.100	Min. : 1	Min. : 0.0100	Min. : 0.000	Min. : 0.000	Min. : 0.000	Min. :0.0500	Min. : 0
Printer :12	1st Qu.:120.8	1st Qu.: 52.66	1st Qu.: 10.0	1st Qu.: 2.75	1st Qu.: 2.00	1st Qu.: 1.00	1st Qu.: 2.00	1st Qu.: 2.00	1st Qu.: 1.000	1st Qu.:0.700	1st Qu.: 7	1st Qu.: 0.5125	1st Qu.: 4.775	1st Qu.: 1.750	1st Qu.: 0.400	1st Qu.:0.0500	1st Qu.: 40
ExtendedWarranty:10	Median :140.5	Median : 132.72	Median : 50.0	Median : 22.00	Median : 7.00	Median : 3.00	Median : 8.50	Median : 5.50	Median : 3.000	Median :0.800	Median : 27	Median : 2.1000	Median : 7.950	Median : 6.800	Median : 3.950	Median :0.1200	Median : 200
Software : 6	Mean :142.6	Mean : 247.25	Mean : 176.2	Mean : 40.20	Mean : 14.79	Mean : 13.79	Mean : 37.67	Mean : 51.75	Mean : 6.225	Mean :0.745	Mean : 1126	Mean : 9.6681	Mean : 14.425	Mean : 7.819	Mean : 6.259	Mean :0.1545	Mean : 705
Display : 5	3rd Qu.:160.2	3rd Qu.: 352.49	3rd Qu.: 306.5	3rd Qu.: 33.00	3rd Qu.: 11.25	3rd Qu.: 7.00	3rd Qu.: 15.25	3rd Qu.: 42.00	3rd Qu.: 6.250	3rd Qu.:0.900	3rd Qu.: 281	3rd Qu.:11.2050	3rd Qu.: 15.025	3rd Qu.:11.275	3rd Qu.:10.300	3rd Qu.:0.2000	3rd Qu.: 1226
PC : 4	Max. :200.0	Max. :2249.99	Max. :2801.0	Max. :431.00	Max. :162.00	Max. :370.00	Max. :1654.00	Max. :536.00	Max. :112.000	Max. :1.000	Max. :17502	Max. :63.0000	Max. :300.000	Max. :31.750	Max. :25.800	Max. :0.4000	Max. :11204
(Other) :17	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :15	NA	NA	NA	NA	NA	NA

kable(summary(new_products), caption = "Table 4: Summary Statistics of New Products")

Table 4: Summary Statistics of New Products
ProductType	ProductNum	Price	x5StarReviews	x4StarReviews	x3StarReviews	x2StarReviews	x1StarReviews	PositiveServiceReview	NegativeServiceReview	Recommendproduct	BestSellersRank	ShippingWeight	ProductDepth	ProductWidth	ProductHeight	ProfitMargin	Volume
Netbook :4	Min. :171.0	Min. : 8.5	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.000	Min. :0.3000	Min. : 1.00	Min. : 0.200	Min. : 0.000	Min. : 0.000	Min. : 0.000	Min. :0.0500	Min. :0
Smartphone :4	1st Qu.:179.5	1st Qu.: 130.0	1st Qu.: 16.00	1st Qu.: 2.00	1st Qu.: 1.75	1st Qu.: 1.00	1st Qu.: 1.75	1st Qu.: 2.00	1st Qu.: 1.000	1st Qu.:0.6000	1st Qu.: 93.25	1st Qu.: 0.900	1st Qu.: 5.225	1st Qu.: 5.832	1st Qu.: 0.400	1st Qu.:0.0975	1st Qu.:0
Laptop :3	Median :193.5	Median : 275.0	Median : 46.00	Median : 10.50	Median : 4.50	Median : 4.00	Median : 13.00	Median : 5.00	Median : 3.500	Median :0.7000	Median : 750.50	Median : 4.450	Median : 8.000	Median : 9.950	Median : 0.985	Median :0.1150	Median :0
Accessories:2	Mean :219.5	Mean : 425.6	Mean : 178.50	Mean : 48.04	Mean : 21.92	Mean : 17.50	Mean : 27.58	Mean :13.46	Mean : 5.667	Mean :0.6708	Mean : 3957.62	Mean : 7.802	Mean : 9.094	Mean :10.408	Mean : 3.541	Mean :0.1817	Mean :0
GameConsole:2	3rd Qu.:301.2	3rd Qu.: 486.5	3rd Qu.: 99.25	3rd Qu.: 26.00	3rd Qu.: 16.75	3rd Qu.: 20.25	3rd Qu.: 35.25	3rd Qu.:12.50	3rd Qu.: 7.500	3rd Qu.:0.8000	3rd Qu.: 3150.00	3rd Qu.: 9.575	3rd Qu.:11.425	3rd Qu.:12.875	3rd Qu.: 2.888	3rd Qu.:0.2000	3rd Qu.:0
PC :2	Max. :307.0	Max. :1999.0	Max. :1525.00	Max. :437.00	Max. :224.00	Max. :160.00	Max. :247.00	Max. :90.00	Max. :23.000	Max. :1.0000	Max. :44465.00	Max. :42.000	Max. :21.890	Max. :27.010	Max. :25.800	Max. :0.9000	Max. :0
(Other) :7	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

# checking product types
kable(unique(existing_products$ProductType), caption = "Table 5: Existing Product Types")

Table 5: Existing Product Types
x
PC
Laptop
Accessories
Software
Display
Printer
PrinterSupplies
ExtendedWarranty
Netbook
Tablet
Smartphone
GameConsole

kable(unique(new_products$ProductType), caption = "Table 6: New Product Types")

Table 6: New Product Types
x
PC
Laptop
Netbook
Tablet
Smartphone
GameConsole
Display
Accessories
Software
Printer
PrinterSupplies
ExtendedWarranty

Section 5 - Data Transformation

In this section, the “Extended Guarantee” category is de-duplicated, and “not available” data is found. All 15 NAs are found in the “BestSellerRank” dataset column. This represent 20% of the total observations. For this reason the attribute was removed from the analysis.

New data categories are created in order to accomplish “full rank parameterization” that is required for accurate correlation and predictive analytics. Correlations between attributes are then found, and displayed via correlation plotting.

# De-duplicate "Extended Guarantee" category
existing_products2 <- existing_products[c(34:41),]
existing_products2 <- aggregate(.~ProductType, data=existing_products2, FUN=mean)
existing_products <- rbind(existing_products[-c(34:41),],existing_products2)
rm(existing_products2)

# Finds missing attrbiutes
sapply(existing_products, function(x) {sum(is.na(x))})

##           ProductType            ProductNum                 Price 
##                     0                     0                     0 
##         x5StarReviews         x4StarReviews         x3StarReviews 
##                     0                     0                     0 
##         x2StarReviews         x1StarReviews PositiveServiceReview 
##                     0                     0                     0 
## NegativeServiceReview      Recommendproduct       BestSellersRank 
##                     0                     0                    15 
##        ShippingWeight          ProductDepth          ProductWidth 
##                     0                     0                     0 
##         ProductHeight          ProfitMargin                Volume 
##                     0                     0                     0

# The BestSellerRank attribute is removed from the analysis
existing_products$BestSellersRank <- NULL

# Creates new data categories for "full rank parameterization" required for accurate correlation and predictive analytics.
newDataframe <- dummyVars(" ~ .", data = existing_products)
readyData <- data.frame(predict(newDataframe, newdata = existing_products))
      
# Finding Correlation between attributes
corrData <- cor(readyData)

# Displays a correlation plot
print(paste("Graph 1: Correlation Plot"))

## [1] "Graph 1: Correlation Plot"

corrplot(corrData)

Section 6 - Feature Detection

The features related to Service Reviews and Product Reviews were detected as the most correlated to predict Volume. To avoid overfitting the predictive analytics, the following thresholds were applied:
1) Independert variables with correlation > 0.95 with the dependent variable were removed.
2) Independent variables with strong correlation between themselves (> 0.85) were removed, keeping only the variables with strongest correlation with the volume.

# High correlation independent variables are selected.
keeps <- c("Volume", "x4StarReviews", "PositiveServiceReview")

# Low correlation independent variables are removed.
drops <- c("x5StarReviews", "x3StarReviews", "x2StarReviews", "x1StarReviews", "NegativeServiceReview")

# Bar Plot of Existing Products, Reviews x Volume
ggplot(existing_products, aes(x=x4StarReviews, y=Volume)) +
  geom_boxplot() +
  ggtitle("Graph 2: Bar Plot of Existing Products, Reviews x Volume")

# Plot of Existing Product Types, Reviews x Volume
ggplot(data = existing_products) +
  geom_smooth(mapping = aes(x=x4StarReviews, y=Volume)) +
  geom_point(mapping = aes(x=x4StarReviews, y=Volume,
                           color=ProductType)) +
  ggtitle("Graph 3: Plot of Existing Product Types, Reviews x Volume")

Section 7 - Outlier Removal

In this section of data pre-processing, the outliers with volume greater than 6000 are removed from the existing products dataset. Then, in order to specify the remaining outlier removal a list of product types for examination is created. 4 star reviews less than 100 and products with a volume of greater than 100 are removed, with the exception of the product type, “Smartphones”.

# remove outliers with volume > 6000
existing_products <- existing_products[-which(existing_products$Volume > 6000), ]

# create list with important product types
producttypes <- c("PC", "Laptop", "Netbook", "Smartphone")

# remove outliers, except smartphone
existing_products <- existing_products[-which(existing_products$x4StarReviews < 100 & existing_products$Volume > 1000 & as.character(existing_products$ProductType) != producttypes[4]),]

Section 8 - Modeling

Cross-validation was used to validate the prediction method, via the creation of training and test datasets for predictive analytics processing. The predictive analytic methods reviewed are:
1) Linear Modeling
2) Support Vector Machines
3) Random Forest

The accuracy of the predictive analytic methods is assessed with the statistics of “Root Mean Squared (RMSE)”, “RSquared”, and “Mean Absolute Error (MAE)”.

RMSE quantifies the error in the measuring unit of the variable. In such cases we need to compute indexes that average the residuals of the model. Because the residuals are both positive and negative and their distribution is nearly symmetrical, their average will always be zero. Therefore, RMSE averages the squared residuals, or the square root of the mean of the squared residuals.

R-Squared is possibly the most commonly used measure of accuracy, and explains the percentage of variance in the target variable. R-Squared is the ratio of the regression sum of squares and the total sum of squares.
MAE is used to mitigate the inflation of the value of RMSE by large residuals. MAE is usually presented with RMSE as an alternate statistics to gain a greater understanding of the accuracy of the prediction model.

# Creates train and test sets for Cross Validation
CrossValidation <- trainControl(method = "repeatedcv", number = 4, repeats = 100)

# Linear Model Predictive Analytics
set.seed(123)
LMFit <- train(Volume~., data = existing_products[keeps], method = "lm", trControl = CrossValidation, preProc = c("center","scale"))

# Table 7: Review the accuracy of the Linear Model fit
LMFit

## Linear Regression 
## 
## 65 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (4 fold, repeated 100 times) 
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   244.1621  0.7721457  118.5894
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

# Support Vector Machines Predictive Analytics
set.seed(123)
SVMFit <- train(Volume~., data = existing_products[keeps], method = "svmLinear2", trControl = CrossValidation, preProc = c("center","scale"))

# Table 8: Review the accuracy of the Support Vector Machines fit
SVMFit

## Support Vector Machines with Linear Kernel 
## 
## 65 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (4 fold, repeated 100 times) 
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ... 
## Resampling results across tuning parameters:
## 
##   cost  RMSE      Rsquared   MAE     
##   0.25  231.0488  0.7741247  107.1101
##   0.50  221.8023  0.7846922  103.7282
##   1.00  216.1003  0.7910762  101.7691
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 1.

# Random Forest Predictive Analytics --- Selected --- 
set.seed(123)
RFFit <- train(Volume~., data = existing_products[keeps], method = "rf", trControl = CrossValidation, preProc = c("center","scale"))

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

# Table 9: Review the accuracy of the Random Forest fit
RFFit

## Random Forest 
## 
## 65 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (4 fold, repeated 100 times) 
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   199.8766  0.8449612  99.5098
## 
## Tuning parameter 'mtry' was held constant at a value of 2

Section 9 - Predicting Sales on New Products

This section creates the sales predictions for four target product types, (PC, Laptops, Netbooks and Smartphones), in the new product attributes data set, formats a table of results, and saves the prediction results in .csv format.

# Selects product types within the the scope of this analysis (PC, Laptops, Netbooks and Smartphones).
new_products <- new_products[which(new_products$ProductType %in% producttypes),] 

# Predicts sales volume
Predictions <- predict(RFFit, newdata=new_products[keeps])

# Places the predictions in the "New Products" volume field
new_products$Volume <- round(Predictions)

# Calculate Revenue  
new_products$Revenue  <- round((new_products$Price * new_products$ProfitMargin * new_products$Volume), digits = 2)

# Format a table of sales predictions
SalesPredictionTable <- data.frame(new_products[,1], new_products$Volume, new_products$Revenue)
names(SalesPredictionTable) <- c("New Products", "Volume Predictions", "Revenue Predictions" )

# Print the table of sales predictions
kable(SalesPredictionTable, caption = "Table 10: Sales Forecasts for New Products")

Table 10: Sales Forecasts for New Products
New Products	Volume Predictions	Revenue Predictions
PC	401	70074.75
PC	130	22360.00
Laptop	201	24099.90
Laptop	52	9352.20
Laptop	6	2758.62
Netbook	51	1631.96
Netbook	1278	37841.58
Netbook	115	5553.35
Netbook	30	891.00
Smartphone	295	6457.55
Smartphone	651	3827.88
Smartphone	82	1832.70
Smartphone	140	4620.00

# Save Predictions in CSV
write.csv(new_products, "new_products_Prediction.csv")

Section 10 - Impact of Customer and Service Reviews on Sales Volume

This section contains visualizations that allow for the assessment of the impact that Services Reviews, and Customer Reviews, have on sales of the entire range of product types.

The first plot is a box plot of the effect of reviews on sales volume. The second plot displays the effect of reviews on sales volume by product type.

Section 11 - Conclusions

The algorithms used for this document include functions for Exploratory Data Analysis that summarize the statistics within the datasets, functions to improve the later predictive analytics via filling in dataset information, functions to find data category correlations, functions for data visualization, and functions to subset the data for outlier removal.

The method of Predictive Analytics used for this document involves Cross-Validation for Machine Learning of trends in the dataset. After the dataset is divided into training and testing datasets, the predicive algorithms used for this document are linear modeling, support vector machines, and random forest, (multiple decision trees).

The Random Forest algorithm was selected with the rationale that the 80% confidence level of the Random Forest algorithm would provide more accurate predictions than the 78.8% confidence level of the Support Vector Machines algorithm, or the 77.2% confidence level of the Linear Modeling algorithm.

Sales predictions were then created for the four target product types found in the new product data set. The results were formatted into a dataframe of predictions.

Lastly, the impact of customer and service reviews on sales volume are displayed visually with the ggplot2 graphics package.

Section 12 - Appendix

Section 12.1 - Required Packages

Table 11: List of Required Packages
	List of Required Packages
Required Packages	‘readr’ ‘lattice’ ‘caret’ ‘datasets’ ‘reshape2’ ‘base’ ‘stats’ ‘e1071’ ‘magrittr’ ‘corrplot’ ‘randomForest’ ‘gbm’ ‘ggplot2’ ‘knitr’

Section 12.2 - Session Information

Table 12: Session Information
	Session Information
R Version	R version 3.6.2 (2019-12-12)
Platform	x86_64-w64-mingw32/x64 (64-bit)
Running	Windows 10 x64 (build 17763)
RStudio Citation	RStudio: Integrated Development Environment for R
RStudio Version	1.0.153

Sales Prediction of New Products

ContextBase, https://contextbase.github.io

All programming by John Akwei, ECMp ERMp Data Scientist

January 28, 2020

Table of Contents

Section 1 - The Problem

Section 2 - The Solution

Section 3 - Data Import

Section 4 - Exploratory Data Analysis

Section 5 - Data Transformation

Section 6 - Feature Detection

Section 7 - Outlier Removal

Section 8 - Modeling

Section 9 - Predicting Sales on New Products

Section 10 - Impact of Customer and Service Reviews on Sales Volume

Section 11 - Conclusions

Section 12 - Appendix

Section 12.1 - Required Packages

Section 12.2 - Session Information