Data Visualization and pre-processing

df <- read.csv(("C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 2/existingproducts.csv"))
NewProducts <- read.csv(("C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 2/newproducts.csv"))

All around summary

skim(df)
Data summary
Name df
Number of rows 80
Number of columns 18
_______________________
Column type frequency:
character 1
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ProductType 0 1 2 16 0 12 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ProductNum 0 1.00 142.55 26.73 101.00 120.75 140.50 160.25 200.00 ▇▇▇▃▃
Price 0 1.00 247.25 339.57 3.60 52.66 132.72 352.49 2249.99 ▇▁▁▁▁
x5StarReviews 0 1.00 176.25 379.19 0.00 10.00 50.00 306.50 2801.00 ▇▁▁▁▁
x4StarReviews 0 1.00 40.20 70.74 0.00 2.75 22.00 33.00 431.00 ▇▁▁▁▁
x3StarReviews 0 1.00 14.79 30.96 0.00 2.00 7.00 11.25 162.00 ▇▁▁▁▁
x2StarReviews 0 1.00 13.79 45.92 0.00 1.00 3.00 7.00 370.00 ▇▁▁▁▁
x1StarReviews 0 1.00 37.67 186.72 0.00 2.00 8.50 15.25 1654.00 ▇▁▁▁▁
PositiveServiceReview 0 1.00 51.75 104.47 0.00 2.00 5.50 42.00 536.00 ▇▁▁▁▁
NegativeServiceReview 0 1.00 6.22 14.30 0.00 1.00 3.00 6.25 112.00 ▇▁▁▁▁
Recommendproduct 0 1.00 0.74 0.20 0.10 0.70 0.80 0.90 1.00 ▁▂▂▇▇
BestSellersRank 15 0.81 1126.31 3261.13 1.00 7.00 27.00 281.00 17502.00 ▇▁▁▁▁
ShippingWeight 0 1.00 9.67 15.21 0.01 0.51 2.10 11.20 63.00 ▇▁▁▁▁
ProductDepth 0 1.00 14.43 34.93 0.00 4.77 7.95 15.03 300.00 ▇▁▁▁▁
ProductWidth 0 1.00 7.82 6.81 0.00 1.75 6.80 11.27 31.75 ▇▆▃▁▁
ProductHeight 0 1.00 6.26 6.89 0.00 0.40 3.95 10.30 25.80 ▇▃▂▁▁
ProfitMargin 0 1.00 0.15 0.12 0.05 0.05 0.12 0.20 0.40 ▇▃▁▁▂
Volume 0 1.00 705.00 1516.77 0.00 40.00 200.00 1226.00 11204.00 ▇▁▁▁▁
sum(duplicated(df))
## [1] 0

Handling Missing Values using MICE pakcage

PMM (Predictive Mean Matching)

df_c <- subset(df, select = -c(ProductType))
summary(df_c)
##    ProductNum        Price         x5StarReviews    x4StarReviews   
##  Min.   :101.0   Min.   :   3.60   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:120.8   1st Qu.:  52.66   1st Qu.:  10.0   1st Qu.:  2.75  
##  Median :140.5   Median : 132.72   Median :  50.0   Median : 22.00  
##  Mean   :142.6   Mean   : 247.25   Mean   : 176.2   Mean   : 40.20  
##  3rd Qu.:160.2   3rd Qu.: 352.49   3rd Qu.: 306.5   3rd Qu.: 33.00  
##  Max.   :200.0   Max.   :2249.99   Max.   :2801.0   Max.   :431.00  
##                                                                     
##  x3StarReviews    x2StarReviews    x1StarReviews     PositiveServiceReview
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.00   Min.   :  0.00       
##  1st Qu.:  2.00   1st Qu.:  1.00   1st Qu.:   2.00   1st Qu.:  2.00       
##  Median :  7.00   Median :  3.00   Median :   8.50   Median :  5.50       
##  Mean   : 14.79   Mean   : 13.79   Mean   :  37.67   Mean   : 51.75       
##  3rd Qu.: 11.25   3rd Qu.:  7.00   3rd Qu.:  15.25   3rd Qu.: 42.00       
##  Max.   :162.00   Max.   :370.00   Max.   :1654.00   Max.   :536.00       
##                                                                           
##  NegativeServiceReview Recommendproduct BestSellersRank ShippingWeight   
##  Min.   :  0.000       Min.   :0.100    Min.   :    1   Min.   : 0.0100  
##  1st Qu.:  1.000       1st Qu.:0.700    1st Qu.:    7   1st Qu.: 0.5125  
##  Median :  3.000       Median :0.800    Median :   27   Median : 2.1000  
##  Mean   :  6.225       Mean   :0.745    Mean   : 1126   Mean   : 9.6681  
##  3rd Qu.:  6.250       3rd Qu.:0.900    3rd Qu.:  281   3rd Qu.:11.2050  
##  Max.   :112.000       Max.   :1.000    Max.   :17502   Max.   :63.0000  
##                                         NA's   :15                       
##   ProductDepth      ProductWidth    ProductHeight     ProfitMargin   
##  Min.   :  0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.0500  
##  1st Qu.:  4.775   1st Qu.: 1.750   1st Qu.: 0.400   1st Qu.:0.0500  
##  Median :  7.950   Median : 6.800   Median : 3.950   Median :0.1200  
##  Mean   : 14.425   Mean   : 7.819   Mean   : 6.259   Mean   :0.1545  
##  3rd Qu.: 15.025   3rd Qu.:11.275   3rd Qu.:10.300   3rd Qu.:0.2000  
##  Max.   :300.000   Max.   :31.750   Max.   :25.800   Max.   :0.4000  
##                                                                      
##      Volume     
##  Min.   :    0  
##  1st Qu.:   40  
##  Median :  200  
##  Mean   :  705  
##  3rd Qu.: 1226  
##  Max.   :11204  
## 
summary(imputed_data)
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##            ProductNum                 Price         x5StarReviews 
##                    ""                    ""                    "" 
##         x4StarReviews         x3StarReviews         x2StarReviews 
##                    ""                    ""                    "" 
##         x1StarReviews PositiveServiceReview NegativeServiceReview 
##                    ""                    ""                    "" 
##      Recommendproduct       BestSellersRank        ShippingWeight 
##                    ""                 "pmm"                    "" 
##          ProductDepth          ProductWidth         ProductHeight 
##                    ""                    ""                    "" 
##          ProfitMargin                Volume 
##                    ""                    "" 
## PredictorMatrix:
##               ProductNum Price x5StarReviews x4StarReviews x3StarReviews
## ProductNum             0     1             1             1             1
## Price                  1     0             1             1             1
## x5StarReviews          1     1             0             1             1
## x4StarReviews          1     1             1             0             1
## x3StarReviews          1     1             1             1             0
## x2StarReviews          1     1             1             1             1
##               x2StarReviews x1StarReviews PositiveServiceReview
## ProductNum                1             1                     1
## Price                     1             1                     1
## x5StarReviews             1             1                     1
## x4StarReviews             1             1                     1
## x3StarReviews             1             1                     1
## x2StarReviews             0             1                     1
##               NegativeServiceReview Recommendproduct BestSellersRank
## ProductNum                        1                1               1
## Price                             1                1               1
## x5StarReviews                     1                1               1
## x4StarReviews                     1                1               1
## x3StarReviews                     1                1               1
## x2StarReviews                     1                1               1
##               ShippingWeight ProductDepth ProductWidth ProductHeight
## ProductNum                 1            1            1             1
## Price                      1            1            1             1
## x5StarReviews              1            1            1             1
## x4StarReviews              1            1            1             1
## x3StarReviews              1            1            1             1
## x2StarReviews              1            1            1             1
##               ProfitMargin Volume
## ProductNum               1      0
## Price                    1      0
## x5StarReviews            1      0
## x4StarReviews            1      0
## x3StarReviews            1      0
## x2StarReviews            1      0
## Number of logged events:  1 
##   it im dep      meth    out
## 1  0  0     collinear Volume

Let’s now check the five columns with the different imputed values

imputed_data$imp$BestSellersRank
##        1     2   3     4   5
## 7     76    48  22 12076   7
## 9     17    17   7   720   7
## 11    60   110 927  4806 100
## 13     2     7  27    16  29
## 15     3  5742  17    11   7
## 19 14086 14086   7   720  18
## 21    11   134  11    17   7
## 30   544     7   2     1  22
## 31   109   352   2     3  22
## 42   100    29   4   110 720
## 63    29     7  11    27   1
## 64    27   661  69   281 927
## 65    27     6   7     7   7
## 66     3   100  10     2   7
## 69     2    27   3    16 352

Let’s now add the values to our data frame, in this case we’ll be using the first column

 df_c <- complete(imputed_data,1)
summary(df_c)
##    ProductNum        Price         x5StarReviews    x4StarReviews   
##  Min.   :101.0   Min.   :   3.60   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:120.8   1st Qu.:  52.66   1st Qu.:  10.0   1st Qu.:  2.75  
##  Median :140.5   Median : 132.72   Median :  50.0   Median : 22.00  
##  Mean   :142.6   Mean   : 247.25   Mean   : 176.2   Mean   : 40.20  
##  3rd Qu.:160.2   3rd Qu.: 352.49   3rd Qu.: 306.5   3rd Qu.: 33.00  
##  Max.   :200.0   Max.   :2249.99   Max.   :2801.0   Max.   :431.00  
##  x3StarReviews    x2StarReviews    x1StarReviews     PositiveServiceReview
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.00   Min.   :  0.00       
##  1st Qu.:  2.00   1st Qu.:  1.00   1st Qu.:   2.00   1st Qu.:  2.00       
##  Median :  7.00   Median :  3.00   Median :   8.50   Median :  5.50       
##  Mean   : 14.79   Mean   : 13.79   Mean   :  37.67   Mean   : 51.75       
##  3rd Qu.: 11.25   3rd Qu.:  7.00   3rd Qu.:  15.25   3rd Qu.: 42.00       
##  Max.   :162.00   Max.   :370.00   Max.   :1654.00   Max.   :536.00       
##  NegativeServiceReview Recommendproduct BestSellersRank   ShippingWeight   
##  Min.   :  0.000       Min.   :0.100    Min.   :    1.0   Min.   : 0.0100  
##  1st Qu.:  1.000       1st Qu.:0.700    1st Qu.:    7.0   1st Qu.: 0.5125  
##  Median :  3.000       Median :0.800    Median :   27.0   Median : 2.1000  
##  Mean   :  6.225       Mean   :0.745    Mean   : 1103.8   Mean   : 9.6681  
##  3rd Qu.:  6.250       3rd Qu.:0.900    3rd Qu.:  228.2   3rd Qu.:11.2050  
##  Max.   :112.000       Max.   :1.000    Max.   :17502.0   Max.   :63.0000  
##   ProductDepth      ProductWidth    ProductHeight     ProfitMargin   
##  Min.   :  0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.0500  
##  1st Qu.:  4.775   1st Qu.: 1.750   1st Qu.: 0.400   1st Qu.:0.0500  
##  Median :  7.950   Median : 6.800   Median : 3.950   Median :0.1200  
##  Mean   : 14.425   Mean   : 7.819   Mean   : 6.259   Mean   :0.1545  
##  3rd Qu.: 15.025   3rd Qu.:11.275   3rd Qu.:10.300   3rd Qu.:0.2000  
##  Max.   :300.000   Max.   :31.750   Max.   :25.800   Max.   :0.4000  
##      Volume     
##  Min.   :    0  
##  1st Qu.:   40  
##  Median :  200  
##  Mean   :  705  
##  3rd Qu.: 1226  
##  Max.   :11204

Now that we have a complete DF and another with missing values, we’ll be using both of them for further analysis in order to see which one performs best for our purposes.

newDataFrame <- dummyVars(" ~ .", data = df)

readyData <- data.frame(predict(newDataFrame, newdata = df))
df_c <- cbind(df_c, ProductType = df$ProductType)

newDataFrame2 <- dummyVars(" ~ .", data = df_c)

completeData <- data.frame(predict(newDataFrame2, newdata = df_c))

For our first data frame we’ll remove the features that have NaN values

readyData$BestSellersRank <- NULL
corrData <- cor(readyData)

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))

corrplot(corrData, method="color", col=col(200), type = "upper", addCoef.col = "black",tl.cex = 0.3, tl.col = "black", tl.srt=90,number.cex = 0.3, diag=FALSE)

corrData1 <- cor(completeData)

corrplot(corrData1, method="color", col=col(200), type = "upper", addCoef.col = "black",tl.cex = 0.3, tl.col = "black", tl.srt=90,number.cex = 0.3, diag=FALSE)

By looking the last correlation plot we can confirm that BestSellerRank feature is actually non correlated to Volume, so we’ll be sticking to our first data set.

Also, let’s get a better view of the correlations by descending order.

c <- cor(readyData)

c[upper.tri(c, diag=TRUE)] <- NA


m <- melt(c)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
m <- m[order(- abs(m$value)), ]

dfOut <- na.omit(m)

head(dfOut,20)
##                        X1                          X2      value
## 420                Volume               x5StarReviews  1.0000000
## 495         x1StarReviews               x2StarReviews  0.9519130
## 437         x3StarReviews               x4StarReviews  0.9372142
## 525 NegativeServiceReview               x1StarReviews  0.8847283
## 408         x4StarReviews               x5StarReviews  0.8790064
## 448                Volume               x4StarReviews  0.8790064
## 497 NegativeServiceReview               x2StarReviews  0.8647548
## 466         x2StarReviews               x3StarReviews  0.8614800
## 83           ProfitMargin ProductTypeExtendedWarranty  0.8022672
## 409         x3StarReviews               x5StarReviews  0.7633732
## 476                Volume               x3StarReviews  0.7633732
## 219        ShippingWeight          ProductTypePrinter  0.7576764
## 642         ProductHeight              ShippingWeight  0.7003111
## 641          ProductWidth              ShippingWeight  0.6924735
## 469 NegativeServiceReview               x3StarReviews  0.6840966
## 467         x1StarReviews               x3StarReviews  0.6792762
## 438         x2StarReviews               x4StarReviews  0.6790056
## 76  PositiveServiceReview ProductTypeExtendedWarranty  0.6271095
## 27           ProfitMargin      ProductTypeAccessories -0.6269352
## 412 PositiveServiceReview               x5StarReviews  0.6222602

It looks like Volume and x5StarReviews are totally correlated, also, we can see that we have a high correlation between our StarReviews, so to eliminate the possibility of collinearity, we’ll be removing some of them for further analysis and predictions.

readyData$x5StarReviews <- NULL
readyData$x3StarReviews <- NULL
readyData$x1StarReviews <- NULL
readyData$NegativeServiceReview <- NULL

We eliminated 5 variables from our dataset given that they were highly correlated between each other and could affect our model.

Let’s see which features are significant in order to predict the Volume using non-parametric tests (because earlier we saw that our variables doesn’t follow a normal distribution)

Kruskal-Wallis Test

for (i in readyData[,0:24]){
  print(kruskal.test(i ~ Volume, data = readyData)$p.value)
}
## [1] 0.06629937
## [1] 0.08341084
## [1] 0.07490137
## [1] 0.01179999
## [1] 0.7027777
## [1] 0.5134315
## [1] 0.9806215
## [1] 0.6920448
## [1] 0.7027777
## [1] 0.07330937
## [1] 0.04310371
## [1] 0.281203
## [1] 0.4431886
## [1] 0.2801196
## [1] 0.01650155
## [1] 0.03205695
## [1] 0.01719002
## [1] 0.1522634
## [1] 0.3610238
## [1] 0.1615929
## [1] 0.2268103
## [1] 0.3613066
## [1] 0.0383235
## [1] 0.01179999
results <- list()
for(i in names(readyData[,0:23])){  
  results[[i]] <- kruskal.test(formula(paste(i, "~ Volume")), data = readyData)$p.value
}
p_values <- data.frame(results)
p_values <- t(p_values)
p_values <- cbind(Feature_Name = rownames(p_values), p_values)
rownames(p_values) <- NULL
p_values <- data.frame(p_values)
p_values$V2 <- as.numeric(p_values$V2)
p <- ggplot(p_values, aes(x = reorder(Feature_Name, -V2), y=V2)) +
  geom_segment(aes(x=reorder(Feature_Name, -V2), xend=Feature_Name, y=0, yend=as.numeric(V2)), color="skyblue") +
  xlab("Feature") + ylab("P-Value") +
  ggtitle("Feature Selection", ) +
  geom_point(color="blue", size=2.5, alpha=0.6) +
  geom_hline(yintercept = 0.00,  color = "green", size=0.4) +
  geom_hline(yintercept = 0.05,  color = "green", size=0.4) +
  geom_rect(aes(xmin=0, xmax=Inf,ymin=0,ymax=0.05), fill='green', alpha= 0.007)+
  theme_light() +
  coord_flip()

p

Based on our results, we could identify 6 features that can help us predict the Volume.

set.seed(20)

library(mlbench)

model <- train(Volume~., data=readyData, method="lm")

importance <- varImp(model, scale = FALSE)

print(importance)
## lm variable importance
## 
##   only 20 most important variables shown (out of 22)
## 
##                             Overall
## x4StarReviews                9.3702
## ProductWidth                 3.1211
## PositiveServiceReview        2.8306
## ProductTypeGameConsole       1.9849
## x2StarReviews                1.7966
## ProductDepth                 1.4185
## ProductTypeNetbook           1.2603
## ShippingWeight               1.2308
## ProductTypeSmartphone        0.9565
## Recommendproduct             0.9510
## ProductTypeExtendedWarranty  0.8568
## ProfitMargin                 0.7860
## ProductTypeSoftware          0.7766
## ProductTypePrinter           0.6589
## ProductTypeDisplay           0.6361
## ProductTypeAccessories       0.5929
## Price                        0.4780
## ProductHeight                0.4427
## ProductTypePrinterSupplies   0.3228
## ProductTypePC                0.2588
plot(importance)

Our previous analysis are clearly showing us that the 4 star review is our most important feature in order to predict the sales volume, still we got other variables that are significantly important to our future models that will help us predict our dependent variable. Bases on our Kruskal-Wallis test, we’ll be choosing all the variables with a p-value of 0.05<=.

df_final <- readyData %>% select(Volume, ProductTypeGameConsole, x4StarReviews,PositiveServiceReview, x2StarReviews, ProfitMargin, ProductTypeSoftware, ProductWidth)
p2 <- plot_ly(df_final, x = ~x4StarReviews, y = ~Volume, type = "scatter", mode = 'markers')
p4 <- plot_ly(df_final, x = ~x2StarReviews, y = ~Volume, type = "scatter", mode = 'markers')

fig9 <- subplot(p2,p4, nrows = 2) %>% 
  layout(title = list(text = "Star Reviews"),
         plot_bgcolor='#e5ecf6', 
         xaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff'), 
         yaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff')) 

fig9

Kind of a clear story, the most star reviews the higuer the sales volume.

t1 <- plot_ly(df_final, x = ~PositiveServiceReview, y = ~Volume, type = "scatter", mode = 'markers') %>%
  layout(title = "Positive Reviews")


t1

In this case we can also identify the difference in the sales volume given a positive or negative review.

fig10 <- plot_ly(df_final, x = as.character(df_final$ProfitMargin), y = df_final$Volume, color = as.character(df_final$ProfitMargin), type = "box") %>% 
         layout(title = "Profit Margin" , boxmode = "group", 
         xaxis = list(title='Profit Margin'), 
         yaxis = list(title='Volume'))
fig10
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

We can identify some changes given a change of margin profit in the volume of our sales, it looks that between 5% and 10% it remains constant and has a spike in our products that have an 18% profit margin

fig10 <- plot_ly(df_final, x = as.character(df_final$ProductTypeGameConsole), y = df_final$Volume, color = as.character(df_final$ProductTypeGameConsole), type = "box") %>% 
         layout(title = "Game Console" , boxmode = "group", 
         xaxis = list(title='Game Console'), 
         yaxis = list(title='Volume'))
fig10
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

Our box-plot clearly shows the difference and impact that our variables GameConsole has on our sales volume.

fig11 <- plot_ly(df_final, x = as.character(df_final$ProductTypeSoftware), y = df_final$Volume, color = as.character(df_final$ProductTypeSoftware), type = "box") %>% 
         layout(title = "Software", boxmode = "group", 
         xaxis = list(title='Software'), 
         yaxis = list(title='Volume'))
 
fig11
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

For our software variable is a different story, we can’t clearly identify the difference by the plot, so for this particular analysis we’ll be taking the hand of our importance plot and not take it into account.

t1 <- ggplot(df_final, aes(x=ProductWidth, y=Volume)) +
    ggtitle("Product Width") +
    geom_point() +
    geom_smooth(method=lm, se=FALSE, col="red")


ggplotly(t1)
## `geom_smooth()` using formula 'y ~ x'

It looks to be having a negative trend, as the correlation told us but not much of a trend that could bring valuable information, so we’ll not be including it in our model.

df_final$ProductTypeSoftware <- NULL
df_final$ProductWidth <- NULL
newDataFrame <- dummyVars(" ~ .", data = NewProducts)

NewProducts <- data.frame(predict(newDataFrame, newdata = NewProducts))

Models

GLM (Generalized Linear Model)

set.seed(998)

# define an 75%/25% train/test split of the dataset
inTraining <- createDataPartition(df_final$Volume, p = .75, list = FALSE)
training <- df_final[inTraining,]
testing <- df_final[-inTraining,]

#train Random Forest Regression model with a tuneLenght = 1 (trains with 1 mtry value for GBM)
rfFit1 <- train(Volume~., data = training, method = "glm", trControl = trainControl(method = "cv"))


#training results
rfFit1
## Generalized Linear Model 
## 
## 61 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 55, 54, 55, 56, 54, 54, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   989.7352  0.8817401  576.4714
rfFit1$results
##   parameter     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      none 989.7352 0.8817401 576.4714 827.1681  0.1744147 362.1388
pred1 <- predict(rfFit1, NewProducts)
pred10 <- predict(rfFit1, testing)

cor(pred10, testing$Volume)^2
## [1] 0.7012356

It looks like our model is overfitting.

SVM

rfFit2 <- train(Volume~., data = training, method = "svmLinear", tuneLength = 10, trControl = trainControl(method = "cv"))


#training results
rfFit2
## Support Vector Machines with Linear Kernel 
## 
## 61 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 54, 56, 55, 56, 55, 55, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   763.1569  0.9192629  405.4651
## 
## Tuning parameter 'C' was held constant at a value of 1
rfFit2$results
##   C     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1 1 763.1569 0.9192629 405.4651 797.3195 0.09245186 340.2871
pred2 <- predict(rfFit2, NewProducts)

pred2
##          1          2          3          4          5          6          7 
##  199.14933   97.95329  218.86386   55.27040  -47.10844  179.11872 1177.11746 
##          8          9         10         11         12         13         14 
##  227.00628  128.83881  774.51815 4078.97933  319.19206  319.15967  114.44921 
##         15         16         17         18         19         20         21 
##  238.11418 5133.98271  135.52003  149.28484  102.60240  154.95548 -639.35541 
##         22         23         24 
## -118.57054 -217.56595 6493.17030
pred10 <- predict(rfFit2, testing)

cor(pred10, testing$Volume)^2
## [1] 0.6520054

Looks like we have some negative values, which can’t happen and also, the model is overfitting.

Random Forest

rfFit3 <- train(Volume~., data = training, method = "rf", tuneLength = 10, trControl = trainControl(method = "cv"))
## note: only 4 unique complexity parameters in default grid. Truncating the grid to 4 .
#training results
rfFit3
## Random Forest 
## 
## 61 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 56, 56, 56, 56, 56, 54, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     828.4399  0.9102431  372.3801
##   3     862.2728  0.9175812  381.0496
##   4     834.7555  0.9216095  369.9122
##   5     842.6461  0.9235479  372.4024
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
rfFit3$results
##   mtry     RMSE  Rsquared      MAE    RMSESD RsquaredSD    MAESD
## 1    2 828.4399 0.9102431 372.3801  961.2026 0.11278575 389.0679
## 2    3 862.2728 0.9175812 381.0496 1003.3125 0.10445390 408.4448
## 3    4 834.7555 0.9216095 369.9122  986.4163 0.10350376 408.2336
## 4    5 842.6461 0.9235479 372.4024 1015.9756 0.09997813 419.3598
pred3 <- predict(rfFit3, NewProducts)

pred3
##          1          2          3          4          5          6          7 
##  442.26410  221.44560  196.32549   23.12414   69.12992   46.22737 1116.67053 
##          8          9         10         11         12         13         14 
##  286.75360   25.74032 1066.34977 5969.72360  409.06396  549.90690  113.50267 
##         15         16         17         18         19         20         21 
##  279.28493 2255.54564   22.20990   25.09586   83.76504  114.56067  139.54222 
##         22         23         24 
##   14.67259   69.76966 3537.82431
pred10 <- predict(rfFit3, testing)

cor(pred10, testing$Volume)^2
## [1] 0.8652733

Our model is giving us pretty decent results, even though that it looks that it might be overfitting, we don’t believe this could be a big problem for our model.

Gradient Boosting

#training results
head(rfFit4$results,10)
##    shrinkage interaction.depth n.minobsinnode n.trees     RMSE  Rsquared
## 1        0.1                 1             10      50 945.9512 0.7218259
## 11       0.1                 2             10      50 949.6157 0.7358646
## 21       0.1                 3             10      50 978.7432 0.7315849
## 31       0.1                 4             10      50 990.6231 0.7376646
## 41       0.1                 5             10      50 932.9612 0.7480649
## 51       0.1                 6             10      50 951.8864 0.7592244
## 61       0.1                 7             10      50 921.7977 0.7373406
## 71       0.1                 8             10      50 942.2771 0.7483279
## 81       0.1                 9             10      50 952.6352 0.7344870
## 91       0.1                10             10      50 981.4375 0.7246907
##         MAE   RMSESD RsquaredSD    MAESD
## 1  550.2855 1074.704  0.2664260 428.4453
## 11 527.5128 1083.015  0.2888108 446.8419
## 21 531.2093 1083.791  0.2669793 423.5485
## 31 566.8528 1057.379  0.2774075 440.9927
## 41 529.8263 1078.462  0.2567087 448.1568
## 51 528.2007 1079.125  0.2408508 402.7447
## 61 508.1109 1096.790  0.2599985 450.5425
## 71 543.5619 1073.001  0.2536127 438.4721
## 81 535.0432 1083.527  0.2696207 428.7920
## 91 555.7202 1065.828  0.2490614 421.6123
pred4 <- predict(rfFit4, NewProducts)

pred4
##  [1]  440.042120  494.868771   -4.982187  -10.157833   67.055072  -93.583244
##  [7] 1874.913277  224.328088  -81.227054 1778.334768 2341.587789  180.047661
## [13]  553.688351  -38.750876  224.328088 2030.073331  180.973468  180.973468
## [19] -126.554039  -31.038513   48.786293   79.830641   79.830641 2341.587789
pred10 <- predict(rfFit4, testing)

cor(pred10, testing$Volume)^2
## [1] 0.6894207

Not the model we are looking for, negative values..

Linear Regression Model

rfFit5 <- train(Volume~., data = training, method = "lm", tuneLength = 10, trControl = trainControl(method = "cv"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
#training results
rfFit5
## Linear Regression 
## 
## 61 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 54, 55, 57, 53, 55, 55, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   964.7516  0.8032241  575.9304
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
rfFit5$results
##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 964.7516 0.8032241 575.9304 866.8826  0.1860898 411.7883
pred5 <- predict(rfFit5, NewProducts)

pred5
##            1            2            3            4            5            6 
##   -18.427438  -112.067371   171.105871  -111.744637  -335.672332   125.275315 
##            7            8            9           10           11           12 
##  1477.584915   118.384972    44.360524   929.697747  5214.394008   266.309738 
##           13           14           15           16           17           18 
##   224.787470   -33.299396   138.772203  4756.698595    92.598201   113.329780 
##           19           20           21           22           23           24 
##    -7.245702   -14.956643 -1762.500312  -490.669438  -724.191266  6520.838798
pred10 <- predict(rfFit5, testing)

cor(pred10, testing$Volume)^2
## [1] 0.7012356

Looks like our LM model is overfitting and also predicting some negative values.

Polynomial Kernel Regularized Least Squares

rfFit6 <- train(Volume~., data = training, method = "krlsPoly", tuneLength = 10,
                 trControl = trainControl(method = "cv"))


#training results
rfFit6
## Polynomial Kernel Regularized Least Squares 
## 
## 61 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 55, 55, 54, 56, 56, 54, ... 
## Resampling results across tuning parameters:
## 
##   degree  RMSE       Rsquared   MAE      
##   1       10216.387  0.5835669  8962.3431
##   2        1000.329  0.6859279   728.8218
##   3        1027.118  0.6825610   762.2832
## 
## Tuning parameter 'lambda' was held constant at a value of NA
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were lambda = NA and degree = 2.
rfFit6$results
##   lambda degree      RMSE  Rsquared       MAE   RMSESD RsquaredSD     MAESD
## 1     NA      1 10216.387 0.5835669 8962.3431 6135.611  0.3065725 5886.5167
## 2     NA      2  1000.329 0.6859279  728.8218 1140.508  0.2588386  467.1964
## 3     NA      3  1027.118 0.6825610  762.2832 1132.043  0.2435466  466.1213
pred6 <- predict(rfFit6, NewProducts)

pred6
##  [1] 662.4859 646.9589 649.7113 640.7030 650.0035 652.4429 712.0468 653.7393
##  [9] 649.1024 680.1917 768.4608 654.2306 661.6993 642.0568 653.1575 769.5608
## [17] 660.4541 660.7129 647.0344 646.6327 769.2336 673.1936 713.5219 770.3975
pred10 <- predict(rfFit6, testing)

cor(pred10, testing$Volume)^2
## [1] 0.7399693

It looks like our model might be underfitting, still, far away from the results we obtained from our RF model.

Looks like the Random Forest regression Model is our go to model to try and predict the sales volume in this particular scenario.

Lets use those predictions and add them to our dataset.

output <- NewProducts

output$predictions <- pred3

write.csv(output, file="C2.T3output.csv", row.names = TRUE)