Section 1 - The Problem
Section 2 - The Solution
Section 3 - Data Import
Section 4 - Exploratory Data Analysis
Section 5 - Data Transformation
Section 6 - Feature Detection
Section 7 - Outlier Removal
Section 8 - Modeling
Section 9 - Predicting Sales on New Products
Section 10 - Impact of Customer and Service Reviews on Sales Volume
Section 11 - Conclusions
Section 12 - Appendix
Section 12.1 - Required Packages
Section 12.2 - Session Information
Sales Teams need to better understand how new types of Products might impact Sales across the Enterprise, and how specific new Product Types perform against each other. Therefore, prediction of sales volume for new Product Types is required, even though the only data available for analysis is historical sales data.
The solution this document proposes is to use data of historical sales and analyze the impact of services reviews and customer reviews on historical products to predict sales volume of new products. Exploratory Data Analysis, Data Transformation, Feature Detection, Outlier Removal, and Predictive Modeling are applied, and the results presented in tabular and graphic forms. Four different product types: PC, Laptops, Netbooks and Smartphones are assessed.
Existing products and new products datasets imported are in the .csv format. These datasets contain product type, number, price, reviews, recommendation, rank, weight, dimension, profit margin, and volume data for a range of electronic products sold by the client.
existing_products <- read.csv("Assignment 1/Assignment 1 Data/existingproductattributes2017.csv")
new_products <- read.csv("Assignment 1/Assignment 1 Data/newproductattributes2017.csv")
The Data Science analysis of the imported datasets begins with exploration of the categories of observations, and summary statitics of the datasets.
kable(existing_products[1:5,], caption = "Table 1: Sample of Existing Products Records")
| ProductType | ProductNum | Price | x5StarReviews | x4StarReviews | x3StarReviews | x2StarReviews | x1StarReviews | PositiveServiceReview | NegativeServiceReview | Recommendproduct | BestSellersRank | ShippingWeight | ProductDepth | ProductWidth | ProductHeight | ProfitMargin | Volume |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PC | 101 | 949.00 | 3 | 3 | 2 | 0 | 0 | 2 | 0 | 0.9 | 1967 | 25.8 | 23.94 | 6.62 | 16.89 | 0.15 | 12 |
| PC | 102 | 2249.99 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0.9 | 4806 | 50.0 | 35.00 | 31.75 | 19.00 | 0.25 | 8 |
| PC | 103 | 399.00 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0.9 | 12076 | 17.4 | 10.50 | 8.30 | 10.20 | 0.08 | 12 |
| Laptop | 104 | 409.99 | 49 | 19 | 8 | 3 | 9 | 7 | 8 | 0.8 | 109 | 5.7 | 15.00 | 9.90 | 1.30 | 0.08 | 196 |
| Laptop | 105 | 1079.99 | 58 | 31 | 11 | 7 | 36 | 7 | 20 | 0.7 | 268 | 7.0 | 12.90 | 0.30 | 8.90 | 0.09 | 232 |
kable(new_products[1:5,], caption = "Table 2: Sample of New Products Records")
| ProductType | ProductNum | Price | x5StarReviews | x4StarReviews | x3StarReviews | x2StarReviews | x1StarReviews | PositiveServiceReview | NegativeServiceReview | Recommendproduct | BestSellersRank | ShippingWeight | ProductDepth | ProductWidth | ProductHeight | ProfitMargin | Volume |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PC | 171 | 699 | 96 | 26 | 14 | 14 | 25 | 12 | 3 | 0.7 | 2498 | 19.9 | 20.63 | 19.25 | 8.39 | 0.25 | 0 |
| PC | 172 | 860 | 51 | 11 | 10 | 10 | 21 | 7 | 5 | 0.6 | 490 | 27.0 | 21.89 | 27.01 | 9.13 | 0.20 | 0 |
| Laptop | 173 | 1199 | 74 | 10 | 3 | 3 | 11 | 11 | 5 | 0.8 | 111 | 6.6 | 8.94 | 12.80 | 0.68 | 0.10 | 0 |
| Laptop | 175 | 1199 | 7 | 2 | 1 | 1 | 1 | 2 | 1 | 0.6 | 4446 | 13.0 | 16.30 | 10.80 | 1.40 | 0.15 | 0 |
| Laptop | 176 | 1999 | 1 | 1 | 1 | 3 | 0 | 0 | 1 | 0.3 | 2820 | 11.6 | 16.81 | 10.90 | 0.88 | 0.23 | 0 |
# Summary
kable(summary(existing_products), caption = "Table 3: Summary Statistics of Existing Products")
| ProductType | ProductNum | Price | x5StarReviews | x4StarReviews | x3StarReviews | x2StarReviews | x1StarReviews | PositiveServiceReview | NegativeServiceReview | Recommendproduct | BestSellersRank | ShippingWeight | ProductDepth | ProductWidth | ProductHeight | ProfitMargin | Volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accessories :26 | Min. :101.0 | Min. : 3.60 | Min. : 0.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 | Min. :0.100 | Min. : 1 | Min. : 0.0100 | Min. : 0.000 | Min. : 0.000 | Min. : 0.000 | Min. :0.0500 | Min. : 0 | |
| Printer :12 | 1st Qu.:120.8 | 1st Qu.: 52.66 | 1st Qu.: 10.0 | 1st Qu.: 2.75 | 1st Qu.: 2.00 | 1st Qu.: 1.00 | 1st Qu.: 2.00 | 1st Qu.: 2.00 | 1st Qu.: 1.000 | 1st Qu.:0.700 | 1st Qu.: 7 | 1st Qu.: 0.5125 | 1st Qu.: 4.775 | 1st Qu.: 1.750 | 1st Qu.: 0.400 | 1st Qu.:0.0500 | 1st Qu.: 40 | |
| ExtendedWarranty:10 | Median :140.5 | Median : 132.72 | Median : 50.0 | Median : 22.00 | Median : 7.00 | Median : 3.00 | Median : 8.50 | Median : 5.50 | Median : 3.000 | Median :0.800 | Median : 27 | Median : 2.1000 | Median : 7.950 | Median : 6.800 | Median : 3.950 | Median :0.1200 | Median : 200 | |
| Software : 6 | Mean :142.6 | Mean : 247.25 | Mean : 176.2 | Mean : 40.20 | Mean : 14.79 | Mean : 13.79 | Mean : 37.67 | Mean : 51.75 | Mean : 6.225 | Mean :0.745 | Mean : 1126 | Mean : 9.6681 | Mean : 14.425 | Mean : 7.819 | Mean : 6.259 | Mean :0.1545 | Mean : 705 | |
| Display : 5 | 3rd Qu.:160.2 | 3rd Qu.: 352.49 | 3rd Qu.: 306.5 | 3rd Qu.: 33.00 | 3rd Qu.: 11.25 | 3rd Qu.: 7.00 | 3rd Qu.: 15.25 | 3rd Qu.: 42.00 | 3rd Qu.: 6.250 | 3rd Qu.:0.900 | 3rd Qu.: 281 | 3rd Qu.:11.2050 | 3rd Qu.: 15.025 | 3rd Qu.:11.275 | 3rd Qu.:10.300 | 3rd Qu.:0.2000 | 3rd Qu.: 1226 | |
| PC : 4 | Max. :200.0 | Max. :2249.99 | Max. :2801.0 | Max. :431.00 | Max. :162.00 | Max. :370.00 | Max. :1654.00 | Max. :536.00 | Max. :112.000 | Max. :1.000 | Max. :17502 | Max. :63.0000 | Max. :300.000 | Max. :31.750 | Max. :25.800 | Max. :0.4000 | Max. :11204 | |
| (Other) :17 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :15 | NA | NA | NA | NA | NA | NA |
kable(summary(new_products), caption = "Table 4: Summary Statistics of New Products")
| ProductType | ProductNum | Price | x5StarReviews | x4StarReviews | x3StarReviews | x2StarReviews | x1StarReviews | PositiveServiceReview | NegativeServiceReview | Recommendproduct | BestSellersRank | ShippingWeight | ProductDepth | ProductWidth | ProductHeight | ProfitMargin | Volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Netbook :4 | Min. :171.0 | Min. : 8.5 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 | Min. :0.3000 | Min. : 1.00 | Min. : 0.200 | Min. : 0.000 | Min. : 0.000 | Min. : 0.000 | Min. :0.0500 | Min. :0 | |
| Smartphone :4 | 1st Qu.:179.5 | 1st Qu.: 130.0 | 1st Qu.: 16.00 | 1st Qu.: 2.00 | 1st Qu.: 1.75 | 1st Qu.: 1.00 | 1st Qu.: 1.75 | 1st Qu.: 2.00 | 1st Qu.: 1.000 | 1st Qu.:0.6000 | 1st Qu.: 93.25 | 1st Qu.: 0.900 | 1st Qu.: 5.225 | 1st Qu.: 5.832 | 1st Qu.: 0.400 | 1st Qu.:0.0975 | 1st Qu.:0 | |
| Laptop :3 | Median :193.5 | Median : 275.0 | Median : 46.00 | Median : 10.50 | Median : 4.50 | Median : 4.00 | Median : 13.00 | Median : 5.00 | Median : 3.500 | Median :0.7000 | Median : 750.50 | Median : 4.450 | Median : 8.000 | Median : 9.950 | Median : 0.985 | Median :0.1150 | Median :0 | |
| Accessories:2 | Mean :219.5 | Mean : 425.6 | Mean : 178.50 | Mean : 48.04 | Mean : 21.92 | Mean : 17.50 | Mean : 27.58 | Mean :13.46 | Mean : 5.667 | Mean :0.6708 | Mean : 3957.62 | Mean : 7.802 | Mean : 9.094 | Mean :10.408 | Mean : 3.541 | Mean :0.1817 | Mean :0 | |
| GameConsole:2 | 3rd Qu.:301.2 | 3rd Qu.: 486.5 | 3rd Qu.: 99.25 | 3rd Qu.: 26.00 | 3rd Qu.: 16.75 | 3rd Qu.: 20.25 | 3rd Qu.: 35.25 | 3rd Qu.:12.50 | 3rd Qu.: 7.500 | 3rd Qu.:0.8000 | 3rd Qu.: 3150.00 | 3rd Qu.: 9.575 | 3rd Qu.:11.425 | 3rd Qu.:12.875 | 3rd Qu.: 2.888 | 3rd Qu.:0.2000 | 3rd Qu.:0 | |
| PC :2 | Max. :307.0 | Max. :1999.0 | Max. :1525.00 | Max. :437.00 | Max. :224.00 | Max. :160.00 | Max. :247.00 | Max. :90.00 | Max. :23.000 | Max. :1.0000 | Max. :44465.00 | Max. :42.000 | Max. :21.890 | Max. :27.010 | Max. :25.800 | Max. :0.9000 | Max. :0 | |
| (Other) :7 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
# checking product types
kable(unique(existing_products$ProductType), caption = "Table 5: Existing Product Types")
| x |
|---|
| PC |
| Laptop |
| Accessories |
| Software |
| Display |
| Printer |
| PrinterSupplies |
| ExtendedWarranty |
| Netbook |
| Tablet |
| Smartphone |
| GameConsole |
kable(unique(new_products$ProductType), caption = "Table 6: New Product Types")
| x |
|---|
| PC |
| Laptop |
| Netbook |
| Tablet |
| Smartphone |
| GameConsole |
| Display |
| Accessories |
| Software |
| Printer |
| PrinterSupplies |
| ExtendedWarranty |
In this section, the “Extended Guarantee” category is de-duplicated, and “not available” data is found. All 15 NAs are found in the “BestSellerRank” dataset column. This represent 20% of the total observations. For this reason the attribute was removed from the analysis.
New data categories are created in order to accomplish “full rank parameterization” that is required for accurate correlation and predictive analytics. Correlations between attributes are then found, and displayed via correlation plotting.
# De-duplicate "Extended Guarantee" category
existing_products2 <- existing_products[c(34:41),]
existing_products2 <- aggregate(.~ProductType, data=existing_products2, FUN=mean)
existing_products <- rbind(existing_products[-c(34:41),],existing_products2)
rm(existing_products2)
# Finds missing attrbiutes
sapply(existing_products, function(x) {sum(is.na(x))})
## ProductType ProductNum Price
## 0 0 0
## x5StarReviews x4StarReviews x3StarReviews
## 0 0 0
## x2StarReviews x1StarReviews PositiveServiceReview
## 0 0 0
## NegativeServiceReview Recommendproduct BestSellersRank
## 0 0 15
## ShippingWeight ProductDepth ProductWidth
## 0 0 0
## ProductHeight ProfitMargin Volume
## 0 0 0
# The BestSellerRank attribute is removed from the analysis
existing_products$BestSellersRank <- NULL
# Creates new data categories for "full rank parameterization" required for accurate correlation and predictive analytics.
newDataframe <- dummyVars(" ~ .", data = existing_products)
readyData <- data.frame(predict(newDataframe, newdata = existing_products))
# Finding Correlation between attributes
corrData <- cor(readyData)
# Displays a correlation plot
print(paste("Graph 1: Correlation Plot"))
## [1] "Graph 1: Correlation Plot"
corrplot(corrData)
The features related to Service Reviews and Product Reviews were detected as the most correlated to predict Volume. To avoid overfitting the predictive analytics, the following thresholds were applied:
1) Independert variables with correlation > 0.95 with the dependent variable were removed.
2) Independent variables with strong correlation between themselves (> 0.85) were removed, keeping only the variables with strongest correlation with the volume.
# High correlation independent variables are selected.
keeps <- c("Volume", "x4StarReviews", "PositiveServiceReview")
# Low correlation independent variables are removed.
drops <- c("x5StarReviews", "x3StarReviews", "x2StarReviews", "x1StarReviews", "NegativeServiceReview")
# Bar Plot of Existing Products, Reviews x Volume
ggplot(existing_products, aes(x=x4StarReviews, y=Volume)) +
geom_boxplot() +
ggtitle("Graph 2: Bar Plot of Existing Products, Reviews x Volume")
# Plot of Existing Product Types, Reviews x Volume
ggplot(data = existing_products) +
geom_smooth(mapping = aes(x=x4StarReviews, y=Volume)) +
geom_point(mapping = aes(x=x4StarReviews, y=Volume,
color=ProductType)) +
ggtitle("Graph 3: Plot of Existing Product Types, Reviews x Volume")
In this section of data pre-processing, the outliers with volume greater than 6000 are removed from the existing products dataset. Then, in order to specify the remaining outlier removal a list of product types for examination is created. 4 star reviews less than 100 and products with a volume of greater than 100 are removed, with the exception of the product type, “Smartphones”.
# remove outliers with volume > 6000
existing_products <- existing_products[-which(existing_products$Volume > 6000), ]
# create list with important product types
producttypes <- c("PC", "Laptop", "Netbook", "Smartphone")
# remove outliers, except smartphone
existing_products <- existing_products[-which(existing_products$x4StarReviews < 100 & existing_products$Volume > 1000 & as.character(existing_products$ProductType) != producttypes[4]),]
Cross-validation was used to validate the prediction method, via the creation of training and test datasets for predictive analytics processing. The predictive analytic methods reviewed are:
1) Linear Modeling
2) Support Vector Machines
3) Random Forest
The accuracy of the predictive analytic methods is assessed with the statistics of “Root Mean Squared (RMSE)”, “RSquared”, and “Mean Absolute Error (MAE)”.
RMSE quantifies the error in the measuring unit of the variable. In such cases we need to compute indexes that average the residuals of the model. Because the residuals are both positive and negative and their distribution is nearly symmetrical, their average will always be zero. Therefore, RMSE averages the squared residuals, or the square root of the mean of the squared residuals.
R-Squared is possibly the most commonly used measure of accuracy, and explains the percentage of variance in the target variable. R-Squared is the ratio of the regression sum of squares and the total sum of squares.
MAE is used to mitigate the inflation of the value of RMSE by large residuals. MAE is usually presented with RMSE as an alternate statistics to gain a greater understanding of the accuracy of the prediction model.
# Creates train and test sets for Cross Validation
CrossValidation <- trainControl(method = "repeatedcv", number = 4, repeats = 100)
# Linear Model Predictive Analytics
set.seed(123)
LMFit <- train(Volume~., data = existing_products[keeps], method = "lm", trControl = CrossValidation, preProc = c("center","scale"))
# Table 7: Review the accuracy of the Linear Model fit
LMFit
## Linear Regression
##
## 65 samples
## 2 predictor
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (4 fold, repeated 100 times)
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 244.1621 0.7721457 118.5894
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Support Vector Machines Predictive Analytics
set.seed(123)
SVMFit <- train(Volume~., data = existing_products[keeps], method = "svmLinear2", trControl = CrossValidation, preProc = c("center","scale"))
# Table 8: Review the accuracy of the Support Vector Machines fit
SVMFit
## Support Vector Machines with Linear Kernel
##
## 65 samples
## 2 predictor
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (4 fold, repeated 100 times)
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ...
## Resampling results across tuning parameters:
##
## cost RMSE Rsquared MAE
## 0.25 231.0488 0.7741247 107.1101
## 0.50 221.8023 0.7846922 103.7282
## 1.00 216.1003 0.7910762 101.7691
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 1.
# Random Forest Predictive Analytics --- Selected ---
set.seed(123)
RFFit <- train(Volume~., data = existing_products[keeps], method = "rf", trControl = CrossValidation, preProc = c("center","scale"))
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
# Table 9: Review the accuracy of the Random Forest fit
RFFit
## Random Forest
##
## 65 samples
## 2 predictor
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (4 fold, repeated 100 times)
## Summary of sample sizes: 48, 49, 49, 49, 48, 49, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 199.8766 0.8449612 99.5098
##
## Tuning parameter 'mtry' was held constant at a value of 2
This section creates the sales predictions for four target product types, (PC, Laptops, Netbooks and Smartphones), in the new product attributes data set, formats a table of results, and saves the prediction results in .csv format.
# Selects product types within the the scope of this analysis (PC, Laptops, Netbooks and Smartphones).
new_products <- new_products[which(new_products$ProductType %in% producttypes),]
# Predicts sales volume
Predictions <- predict(RFFit, newdata=new_products[keeps])
# Places the predictions in the "New Products" volume field
new_products$Volume <- round(Predictions)
# Calculate Revenue
new_products$Revenue <- round((new_products$Price * new_products$ProfitMargin * new_products$Volume), digits = 2)
# Format a table of sales predictions
SalesPredictionTable <- data.frame(new_products[,1], new_products$Volume, new_products$Revenue)
names(SalesPredictionTable) <- c("New Products", "Volume Predictions", "Revenue Predictions" )
# Print the table of sales predictions
kable(SalesPredictionTable, caption = "Table 10: Sales Forecasts for New Products")
| New Products | Volume Predictions | Revenue Predictions |
|---|---|---|
| PC | 401 | 70074.75 |
| PC | 130 | 22360.00 |
| Laptop | 201 | 24099.90 |
| Laptop | 52 | 9352.20 |
| Laptop | 6 | 2758.62 |
| Netbook | 51 | 1631.96 |
| Netbook | 1278 | 37841.58 |
| Netbook | 115 | 5553.35 |
| Netbook | 30 | 891.00 |
| Smartphone | 295 | 6457.55 |
| Smartphone | 651 | 3827.88 |
| Smartphone | 82 | 1832.70 |
| Smartphone | 140 | 4620.00 |
# Save Predictions in CSV
write.csv(new_products, "new_products_Prediction.csv")
This section contains visualizations that allow for the assessment of the impact that Services Reviews, and Customer Reviews, have on sales of the entire range of product types.
The first plot is a box plot of the effect of reviews on sales volume. The second plot displays the effect of reviews on sales volume by product type.
The algorithms used for this document include functions for Exploratory Data Analysis that summarize the statistics within the datasets, functions to improve the later predictive analytics via filling in dataset information, functions to find data category correlations, functions for data visualization, and functions to subset the data for outlier removal.
The method of Predictive Analytics used for this document involves Cross-Validation for Machine Learning of trends in the dataset. After the dataset is divided into training and testing datasets, the predicive algorithms used for this document are linear modeling, support vector machines, and random forest, (multiple decision trees).
The Random Forest algorithm was selected with the rationale that the 80% confidence level of the Random Forest algorithm would provide more accurate predictions than the 78.8% confidence level of the Support Vector Machines algorithm, or the 77.2% confidence level of the Linear Modeling algorithm.
Sales predictions were then created for the four target product types found in the new product data set. The results were formatted into a dataframe of predictions.
Lastly, the impact of customer and service reviews on sales volume are displayed visually with the ggplot2 graphics package.
| List of Required Packages | |
|---|---|
| Required Packages | ‘readr’ ‘lattice’ ‘caret’ ‘datasets’ ‘reshape2’ ‘base’ ‘stats’ ‘e1071’ ‘magrittr’ ‘corrplot’ ‘randomForest’ ‘gbm’ ‘ggplot2’ ‘knitr’ |
| Session Information | |
|---|---|
| R Version | R version 3.6.2 (2019-12-12) |
| Platform | x86_64-w64-mingw32/x64 (64-bit) |
| Running | Windows 10 x64 (build 17763) |
| RStudio Citation | RStudio: Integrated Development Environment for R |
| RStudio Version | 1.0.153 |