df <- read.csv(("C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 2/existingproducts.csv"))
NewProducts <- read.csv(("C:/Users/andre/OneDrive/Andres Marquez/UBIQUM/Project 3/Part 2/newproducts.csv"))
All around summary
skim(df)
| Name | df |
| Number of rows | 80 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ProductType | 0 | 1 | 2 | 16 | 0 | 12 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ProductNum | 0 | 1.00 | 142.55 | 26.73 | 101.00 | 120.75 | 140.50 | 160.25 | 200.00 | ▇▇▇▃▃ |
| Price | 0 | 1.00 | 247.25 | 339.57 | 3.60 | 52.66 | 132.72 | 352.49 | 2249.99 | ▇▁▁▁▁ |
| x5StarReviews | 0 | 1.00 | 176.25 | 379.19 | 0.00 | 10.00 | 50.00 | 306.50 | 2801.00 | ▇▁▁▁▁ |
| x4StarReviews | 0 | 1.00 | 40.20 | 70.74 | 0.00 | 2.75 | 22.00 | 33.00 | 431.00 | ▇▁▁▁▁ |
| x3StarReviews | 0 | 1.00 | 14.79 | 30.96 | 0.00 | 2.00 | 7.00 | 11.25 | 162.00 | ▇▁▁▁▁ |
| x2StarReviews | 0 | 1.00 | 13.79 | 45.92 | 0.00 | 1.00 | 3.00 | 7.00 | 370.00 | ▇▁▁▁▁ |
| x1StarReviews | 0 | 1.00 | 37.67 | 186.72 | 0.00 | 2.00 | 8.50 | 15.25 | 1654.00 | ▇▁▁▁▁ |
| PositiveServiceReview | 0 | 1.00 | 51.75 | 104.47 | 0.00 | 2.00 | 5.50 | 42.00 | 536.00 | ▇▁▁▁▁ |
| NegativeServiceReview | 0 | 1.00 | 6.22 | 14.30 | 0.00 | 1.00 | 3.00 | 6.25 | 112.00 | ▇▁▁▁▁ |
| Recommendproduct | 0 | 1.00 | 0.74 | 0.20 | 0.10 | 0.70 | 0.80 | 0.90 | 1.00 | ▁▂▂▇▇ |
| BestSellersRank | 15 | 0.81 | 1126.31 | 3261.13 | 1.00 | 7.00 | 27.00 | 281.00 | 17502.00 | ▇▁▁▁▁ |
| ShippingWeight | 0 | 1.00 | 9.67 | 15.21 | 0.01 | 0.51 | 2.10 | 11.20 | 63.00 | ▇▁▁▁▁ |
| ProductDepth | 0 | 1.00 | 14.43 | 34.93 | 0.00 | 4.77 | 7.95 | 15.03 | 300.00 | ▇▁▁▁▁ |
| ProductWidth | 0 | 1.00 | 7.82 | 6.81 | 0.00 | 1.75 | 6.80 | 11.27 | 31.75 | ▇▆▃▁▁ |
| ProductHeight | 0 | 1.00 | 6.26 | 6.89 | 0.00 | 0.40 | 3.95 | 10.30 | 25.80 | ▇▃▂▁▁ |
| ProfitMargin | 0 | 1.00 | 0.15 | 0.12 | 0.05 | 0.05 | 0.12 | 0.20 | 0.40 | ▇▃▁▁▂ |
| Volume | 0 | 1.00 | 705.00 | 1516.77 | 0.00 | 40.00 | 200.00 | 1226.00 | 11204.00 | ▇▁▁▁▁ |
sum(duplicated(df))
## [1] 0
Handling Missing Values using MICE pakcage
PMM (Predictive Mean Matching)
df_c <- subset(df, select = -c(ProductType))
summary(df_c)
## ProductNum Price x5StarReviews x4StarReviews
## Min. :101.0 Min. : 3.60 Min. : 0.0 Min. : 0.00
## 1st Qu.:120.8 1st Qu.: 52.66 1st Qu.: 10.0 1st Qu.: 2.75
## Median :140.5 Median : 132.72 Median : 50.0 Median : 22.00
## Mean :142.6 Mean : 247.25 Mean : 176.2 Mean : 40.20
## 3rd Qu.:160.2 3rd Qu.: 352.49 3rd Qu.: 306.5 3rd Qu.: 33.00
## Max. :200.0 Max. :2249.99 Max. :2801.0 Max. :431.00
##
## x3StarReviews x2StarReviews x1StarReviews PositiveServiceReview
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 2.00 1st Qu.: 2.00
## Median : 7.00 Median : 3.00 Median : 8.50 Median : 5.50
## Mean : 14.79 Mean : 13.79 Mean : 37.67 Mean : 51.75
## 3rd Qu.: 11.25 3rd Qu.: 7.00 3rd Qu.: 15.25 3rd Qu.: 42.00
## Max. :162.00 Max. :370.00 Max. :1654.00 Max. :536.00
##
## NegativeServiceReview Recommendproduct BestSellersRank ShippingWeight
## Min. : 0.000 Min. :0.100 Min. : 1 Min. : 0.0100
## 1st Qu.: 1.000 1st Qu.:0.700 1st Qu.: 7 1st Qu.: 0.5125
## Median : 3.000 Median :0.800 Median : 27 Median : 2.1000
## Mean : 6.225 Mean :0.745 Mean : 1126 Mean : 9.6681
## 3rd Qu.: 6.250 3rd Qu.:0.900 3rd Qu.: 281 3rd Qu.:11.2050
## Max. :112.000 Max. :1.000 Max. :17502 Max. :63.0000
## NA's :15
## ProductDepth ProductWidth ProductHeight ProfitMargin
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.0500
## 1st Qu.: 4.775 1st Qu.: 1.750 1st Qu.: 0.400 1st Qu.:0.0500
## Median : 7.950 Median : 6.800 Median : 3.950 Median :0.1200
## Mean : 14.425 Mean : 7.819 Mean : 6.259 Mean :0.1545
## 3rd Qu.: 15.025 3rd Qu.:11.275 3rd Qu.:10.300 3rd Qu.:0.2000
## Max. :300.000 Max. :31.750 Max. :25.800 Max. :0.4000
##
## Volume
## Min. : 0
## 1st Qu.: 40
## Median : 200
## Mean : 705
## 3rd Qu.: 1226
## Max. :11204
##
summary(imputed_data)
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## ProductNum Price x5StarReviews
## "" "" ""
## x4StarReviews x3StarReviews x2StarReviews
## "" "" ""
## x1StarReviews PositiveServiceReview NegativeServiceReview
## "" "" ""
## Recommendproduct BestSellersRank ShippingWeight
## "" "pmm" ""
## ProductDepth ProductWidth ProductHeight
## "" "" ""
## ProfitMargin Volume
## "" ""
## PredictorMatrix:
## ProductNum Price x5StarReviews x4StarReviews x3StarReviews
## ProductNum 0 1 1 1 1
## Price 1 0 1 1 1
## x5StarReviews 1 1 0 1 1
## x4StarReviews 1 1 1 0 1
## x3StarReviews 1 1 1 1 0
## x2StarReviews 1 1 1 1 1
## x2StarReviews x1StarReviews PositiveServiceReview
## ProductNum 1 1 1
## Price 1 1 1
## x5StarReviews 1 1 1
## x4StarReviews 1 1 1
## x3StarReviews 1 1 1
## x2StarReviews 0 1 1
## NegativeServiceReview Recommendproduct BestSellersRank
## ProductNum 1 1 1
## Price 1 1 1
## x5StarReviews 1 1 1
## x4StarReviews 1 1 1
## x3StarReviews 1 1 1
## x2StarReviews 1 1 1
## ShippingWeight ProductDepth ProductWidth ProductHeight
## ProductNum 1 1 1 1
## Price 1 1 1 1
## x5StarReviews 1 1 1 1
## x4StarReviews 1 1 1 1
## x3StarReviews 1 1 1 1
## x2StarReviews 1 1 1 1
## ProfitMargin Volume
## ProductNum 1 0
## Price 1 0
## x5StarReviews 1 0
## x4StarReviews 1 0
## x3StarReviews 1 0
## x2StarReviews 1 0
## Number of logged events: 1
## it im dep meth out
## 1 0 0 collinear Volume
Let’s now check the five columns with the different imputed values
imputed_data$imp$BestSellersRank
## 1 2 3 4 5
## 7 76 48 22 12076 7
## 9 17 17 7 720 7
## 11 60 110 927 4806 100
## 13 2 7 27 16 29
## 15 3 5742 17 11 7
## 19 14086 14086 7 720 18
## 21 11 134 11 17 7
## 30 544 7 2 1 22
## 31 109 352 2 3 22
## 42 100 29 4 110 720
## 63 29 7 11 27 1
## 64 27 661 69 281 927
## 65 27 6 7 7 7
## 66 3 100 10 2 7
## 69 2 27 3 16 352
Let’s now add the values to our data frame, in this case we’ll be using the first column
df_c <- complete(imputed_data,1)
summary(df_c)
## ProductNum Price x5StarReviews x4StarReviews
## Min. :101.0 Min. : 3.60 Min. : 0.0 Min. : 0.00
## 1st Qu.:120.8 1st Qu.: 52.66 1st Qu.: 10.0 1st Qu.: 2.75
## Median :140.5 Median : 132.72 Median : 50.0 Median : 22.00
## Mean :142.6 Mean : 247.25 Mean : 176.2 Mean : 40.20
## 3rd Qu.:160.2 3rd Qu.: 352.49 3rd Qu.: 306.5 3rd Qu.: 33.00
## Max. :200.0 Max. :2249.99 Max. :2801.0 Max. :431.00
## x3StarReviews x2StarReviews x1StarReviews PositiveServiceReview
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 2.00 1st Qu.: 2.00
## Median : 7.00 Median : 3.00 Median : 8.50 Median : 5.50
## Mean : 14.79 Mean : 13.79 Mean : 37.67 Mean : 51.75
## 3rd Qu.: 11.25 3rd Qu.: 7.00 3rd Qu.: 15.25 3rd Qu.: 42.00
## Max. :162.00 Max. :370.00 Max. :1654.00 Max. :536.00
## NegativeServiceReview Recommendproduct BestSellersRank ShippingWeight
## Min. : 0.000 Min. :0.100 Min. : 1.0 Min. : 0.0100
## 1st Qu.: 1.000 1st Qu.:0.700 1st Qu.: 7.0 1st Qu.: 0.5125
## Median : 3.000 Median :0.800 Median : 27.0 Median : 2.1000
## Mean : 6.225 Mean :0.745 Mean : 1103.8 Mean : 9.6681
## 3rd Qu.: 6.250 3rd Qu.:0.900 3rd Qu.: 228.2 3rd Qu.:11.2050
## Max. :112.000 Max. :1.000 Max. :17502.0 Max. :63.0000
## ProductDepth ProductWidth ProductHeight ProfitMargin
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.0500
## 1st Qu.: 4.775 1st Qu.: 1.750 1st Qu.: 0.400 1st Qu.:0.0500
## Median : 7.950 Median : 6.800 Median : 3.950 Median :0.1200
## Mean : 14.425 Mean : 7.819 Mean : 6.259 Mean :0.1545
## 3rd Qu.: 15.025 3rd Qu.:11.275 3rd Qu.:10.300 3rd Qu.:0.2000
## Max. :300.000 Max. :31.750 Max. :25.800 Max. :0.4000
## Volume
## Min. : 0
## 1st Qu.: 40
## Median : 200
## Mean : 705
## 3rd Qu.: 1226
## Max. :11204
Now that we have a complete DF and another with missing values, we’ll be using both of them for further analysis in order to see which one performs best for our purposes.
newDataFrame <- dummyVars(" ~ .", data = df)
readyData <- data.frame(predict(newDataFrame, newdata = df))
df_c <- cbind(df_c, ProductType = df$ProductType)
newDataFrame2 <- dummyVars(" ~ .", data = df_c)
completeData <- data.frame(predict(newDataFrame2, newdata = df_c))
For our first data frame we’ll remove the features that have NaN values
readyData$BestSellersRank <- NULL
corrData <- cor(readyData)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(corrData, method="color", col=col(200), type = "upper", addCoef.col = "black",tl.cex = 0.3, tl.col = "black", tl.srt=90,number.cex = 0.3, diag=FALSE)
corrData1 <- cor(completeData)
corrplot(corrData1, method="color", col=col(200), type = "upper", addCoef.col = "black",tl.cex = 0.3, tl.col = "black", tl.srt=90,number.cex = 0.3, diag=FALSE)
By looking the last correlation plot we can confirm that BestSellerRank feature is actually non correlated to Volume, so we’ll be sticking to our first data set.
Also, let’s get a better view of the correlations by descending order.
c <- cor(readyData)
c[upper.tri(c, diag=TRUE)] <- NA
m <- melt(c)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
m <- m[order(- abs(m$value)), ]
dfOut <- na.omit(m)
head(dfOut,20)
## X1 X2 value
## 420 Volume x5StarReviews 1.0000000
## 495 x1StarReviews x2StarReviews 0.9519130
## 437 x3StarReviews x4StarReviews 0.9372142
## 525 NegativeServiceReview x1StarReviews 0.8847283
## 408 x4StarReviews x5StarReviews 0.8790064
## 448 Volume x4StarReviews 0.8790064
## 497 NegativeServiceReview x2StarReviews 0.8647548
## 466 x2StarReviews x3StarReviews 0.8614800
## 83 ProfitMargin ProductTypeExtendedWarranty 0.8022672
## 409 x3StarReviews x5StarReviews 0.7633732
## 476 Volume x3StarReviews 0.7633732
## 219 ShippingWeight ProductTypePrinter 0.7576764
## 642 ProductHeight ShippingWeight 0.7003111
## 641 ProductWidth ShippingWeight 0.6924735
## 469 NegativeServiceReview x3StarReviews 0.6840966
## 467 x1StarReviews x3StarReviews 0.6792762
## 438 x2StarReviews x4StarReviews 0.6790056
## 76 PositiveServiceReview ProductTypeExtendedWarranty 0.6271095
## 27 ProfitMargin ProductTypeAccessories -0.6269352
## 412 PositiveServiceReview x5StarReviews 0.6222602
It looks like Volume and x5StarReviews are totally correlated, also, we can see that we have a high correlation between our StarReviews, so to eliminate the possibility of collinearity, we’ll be removing some of them for further analysis and predictions.
readyData$x5StarReviews <- NULL
readyData$x3StarReviews <- NULL
readyData$x1StarReviews <- NULL
readyData$NegativeServiceReview <- NULL
We eliminated 5 variables from our dataset given that they were highly correlated between each other and could affect our model.
Let’s see which features are significant in order to predict the Volume using non-parametric tests (because earlier we saw that our variables doesn’t follow a normal distribution)
Kruskal-Wallis Test
for (i in readyData[,0:24]){
print(kruskal.test(i ~ Volume, data = readyData)$p.value)
}
## [1] 0.06629937
## [1] 0.08341084
## [1] 0.07490137
## [1] 0.01179999
## [1] 0.7027777
## [1] 0.5134315
## [1] 0.9806215
## [1] 0.6920448
## [1] 0.7027777
## [1] 0.07330937
## [1] 0.04310371
## [1] 0.281203
## [1] 0.4431886
## [1] 0.2801196
## [1] 0.01650155
## [1] 0.03205695
## [1] 0.01719002
## [1] 0.1522634
## [1] 0.3610238
## [1] 0.1615929
## [1] 0.2268103
## [1] 0.3613066
## [1] 0.0383235
## [1] 0.01179999
results <- list()
for(i in names(readyData[,0:23])){
results[[i]] <- kruskal.test(formula(paste(i, "~ Volume")), data = readyData)$p.value
}
p_values <- data.frame(results)
p_values <- t(p_values)
p_values <- cbind(Feature_Name = rownames(p_values), p_values)
rownames(p_values) <- NULL
p_values <- data.frame(p_values)
p_values$V2 <- as.numeric(p_values$V2)
p <- ggplot(p_values, aes(x = reorder(Feature_Name, -V2), y=V2)) +
geom_segment(aes(x=reorder(Feature_Name, -V2), xend=Feature_Name, y=0, yend=as.numeric(V2)), color="skyblue") +
xlab("Feature") + ylab("P-Value") +
ggtitle("Feature Selection", ) +
geom_point(color="blue", size=2.5, alpha=0.6) +
geom_hline(yintercept = 0.00, color = "green", size=0.4) +
geom_hline(yintercept = 0.05, color = "green", size=0.4) +
geom_rect(aes(xmin=0, xmax=Inf,ymin=0,ymax=0.05), fill='green', alpha= 0.007)+
theme_light() +
coord_flip()
p
Based on our results, we could identify 6 features that can help us predict the Volume.
set.seed(20)
library(mlbench)
model <- train(Volume~., data=readyData, method="lm")
importance <- varImp(model, scale = FALSE)
print(importance)
## lm variable importance
##
## only 20 most important variables shown (out of 22)
##
## Overall
## x4StarReviews 9.3702
## ProductWidth 3.1211
## PositiveServiceReview 2.8306
## ProductTypeGameConsole 1.9849
## x2StarReviews 1.7966
## ProductDepth 1.4185
## ProductTypeNetbook 1.2603
## ShippingWeight 1.2308
## ProductTypeSmartphone 0.9565
## Recommendproduct 0.9510
## ProductTypeExtendedWarranty 0.8568
## ProfitMargin 0.7860
## ProductTypeSoftware 0.7766
## ProductTypePrinter 0.6589
## ProductTypeDisplay 0.6361
## ProductTypeAccessories 0.5929
## Price 0.4780
## ProductHeight 0.4427
## ProductTypePrinterSupplies 0.3228
## ProductTypePC 0.2588
plot(importance)
Our previous analysis are clearly showing us that the 4 star review is our most important feature in order to predict the sales volume, still we got other variables that are significantly important to our future models that will help us predict our dependent variable. Bases on our Kruskal-Wallis test, we’ll be choosing all the variables with a p-value of 0.05<=.
df_final <- readyData %>% select(Volume, ProductTypeGameConsole, x4StarReviews,PositiveServiceReview, x2StarReviews, ProfitMargin, ProductTypeSoftware, ProductWidth)
p2 <- plot_ly(df_final, x = ~x4StarReviews, y = ~Volume, type = "scatter", mode = 'markers')
p4 <- plot_ly(df_final, x = ~x2StarReviews, y = ~Volume, type = "scatter", mode = 'markers')
fig9 <- subplot(p2,p4, nrows = 2) %>%
layout(title = list(text = "Star Reviews"),
plot_bgcolor='#e5ecf6',
xaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'))
fig9
Kind of a clear story, the most star reviews the higuer the sales volume.
t1 <- plot_ly(df_final, x = ~PositiveServiceReview, y = ~Volume, type = "scatter", mode = 'markers') %>%
layout(title = "Positive Reviews")
t1
In this case we can also identify the difference in the sales volume given a positive or negative review.
fig10 <- plot_ly(df_final, x = as.character(df_final$ProfitMargin), y = df_final$Volume, color = as.character(df_final$ProfitMargin), type = "box") %>%
layout(title = "Profit Margin" , boxmode = "group",
xaxis = list(title='Profit Margin'),
yaxis = list(title='Volume'))
fig10
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
We can identify some changes given a change of margin profit in the volume of our sales, it looks that between 5% and 10% it remains constant and has a spike in our products that have an 18% profit margin
fig10 <- plot_ly(df_final, x = as.character(df_final$ProductTypeGameConsole), y = df_final$Volume, color = as.character(df_final$ProductTypeGameConsole), type = "box") %>%
layout(title = "Game Console" , boxmode = "group",
xaxis = list(title='Game Console'),
yaxis = list(title='Volume'))
fig10
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
Our box-plot clearly shows the difference and impact that our variables GameConsole has on our sales volume.
fig11 <- plot_ly(df_final, x = as.character(df_final$ProductTypeSoftware), y = df_final$Volume, color = as.character(df_final$ProductTypeSoftware), type = "box") %>%
layout(title = "Software", boxmode = "group",
xaxis = list(title='Software'),
yaxis = list(title='Volume'))
fig11
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
For our software variable is a different story, we can’t clearly identify the difference by the plot, so for this particular analysis we’ll be taking the hand of our importance plot and not take it into account.
t1 <- ggplot(df_final, aes(x=ProductWidth, y=Volume)) +
ggtitle("Product Width") +
geom_point() +
geom_smooth(method=lm, se=FALSE, col="red")
ggplotly(t1)
## `geom_smooth()` using formula 'y ~ x'
It looks to be having a negative trend, as the correlation told us but not much of a trend that could bring valuable information, so we’ll not be including it in our model.
df_final$ProductTypeSoftware <- NULL
df_final$ProductWidth <- NULL
newDataFrame <- dummyVars(" ~ .", data = NewProducts)
NewProducts <- data.frame(predict(newDataFrame, newdata = NewProducts))
GLM (Generalized Linear Model)
set.seed(998)
# define an 75%/25% train/test split of the dataset
inTraining <- createDataPartition(df_final$Volume, p = .75, list = FALSE)
training <- df_final[inTraining,]
testing <- df_final[-inTraining,]
#train Random Forest Regression model with a tuneLenght = 1 (trains with 1 mtry value for GBM)
rfFit1 <- train(Volume~., data = training, method = "glm", trControl = trainControl(method = "cv"))
#training results
rfFit1
## Generalized Linear Model
##
## 61 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 55, 54, 55, 56, 54, 54, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 989.7352 0.8817401 576.4714
rfFit1$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 989.7352 0.8817401 576.4714 827.1681 0.1744147 362.1388
pred1 <- predict(rfFit1, NewProducts)
pred10 <- predict(rfFit1, testing)
cor(pred10, testing$Volume)^2
## [1] 0.7012356
It looks like our model is overfitting.
SVM
rfFit2 <- train(Volume~., data = training, method = "svmLinear", tuneLength = 10, trControl = trainControl(method = "cv"))
#training results
rfFit2
## Support Vector Machines with Linear Kernel
##
## 61 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 54, 56, 55, 56, 55, 55, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 763.1569 0.9192629 405.4651
##
## Tuning parameter 'C' was held constant at a value of 1
rfFit2$results
## C RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 763.1569 0.9192629 405.4651 797.3195 0.09245186 340.2871
pred2 <- predict(rfFit2, NewProducts)
pred2
## 1 2 3 4 5 6 7
## 199.14933 97.95329 218.86386 55.27040 -47.10844 179.11872 1177.11746
## 8 9 10 11 12 13 14
## 227.00628 128.83881 774.51815 4078.97933 319.19206 319.15967 114.44921
## 15 16 17 18 19 20 21
## 238.11418 5133.98271 135.52003 149.28484 102.60240 154.95548 -639.35541
## 22 23 24
## -118.57054 -217.56595 6493.17030
pred10 <- predict(rfFit2, testing)
cor(pred10, testing$Volume)^2
## [1] 0.6520054
Looks like we have some negative values, which can’t happen and also, the model is overfitting.
Random Forest
rfFit3 <- train(Volume~., data = training, method = "rf", tuneLength = 10, trControl = trainControl(method = "cv"))
## note: only 4 unique complexity parameters in default grid. Truncating the grid to 4 .
#training results
rfFit3
## Random Forest
##
## 61 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 56, 56, 56, 56, 56, 54, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 828.4399 0.9102431 372.3801
## 3 862.2728 0.9175812 381.0496
## 4 834.7555 0.9216095 369.9122
## 5 842.6461 0.9235479 372.4024
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
rfFit3$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 2 828.4399 0.9102431 372.3801 961.2026 0.11278575 389.0679
## 2 3 862.2728 0.9175812 381.0496 1003.3125 0.10445390 408.4448
## 3 4 834.7555 0.9216095 369.9122 986.4163 0.10350376 408.2336
## 4 5 842.6461 0.9235479 372.4024 1015.9756 0.09997813 419.3598
pred3 <- predict(rfFit3, NewProducts)
pred3
## 1 2 3 4 5 6 7
## 442.26410 221.44560 196.32549 23.12414 69.12992 46.22737 1116.67053
## 8 9 10 11 12 13 14
## 286.75360 25.74032 1066.34977 5969.72360 409.06396 549.90690 113.50267
## 15 16 17 18 19 20 21
## 279.28493 2255.54564 22.20990 25.09586 83.76504 114.56067 139.54222
## 22 23 24
## 14.67259 69.76966 3537.82431
pred10 <- predict(rfFit3, testing)
cor(pred10, testing$Volume)^2
## [1] 0.8652733
Our model is giving us pretty decent results, even though that it looks that it might be overfitting, we don’t believe this could be a big problem for our model.
Gradient Boosting
#training results
head(rfFit4$results,10)
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 1 0.1 1 10 50 945.9512 0.7218259
## 11 0.1 2 10 50 949.6157 0.7358646
## 21 0.1 3 10 50 978.7432 0.7315849
## 31 0.1 4 10 50 990.6231 0.7376646
## 41 0.1 5 10 50 932.9612 0.7480649
## 51 0.1 6 10 50 951.8864 0.7592244
## 61 0.1 7 10 50 921.7977 0.7373406
## 71 0.1 8 10 50 942.2771 0.7483279
## 81 0.1 9 10 50 952.6352 0.7344870
## 91 0.1 10 10 50 981.4375 0.7246907
## MAE RMSESD RsquaredSD MAESD
## 1 550.2855 1074.704 0.2664260 428.4453
## 11 527.5128 1083.015 0.2888108 446.8419
## 21 531.2093 1083.791 0.2669793 423.5485
## 31 566.8528 1057.379 0.2774075 440.9927
## 41 529.8263 1078.462 0.2567087 448.1568
## 51 528.2007 1079.125 0.2408508 402.7447
## 61 508.1109 1096.790 0.2599985 450.5425
## 71 543.5619 1073.001 0.2536127 438.4721
## 81 535.0432 1083.527 0.2696207 428.7920
## 91 555.7202 1065.828 0.2490614 421.6123
pred4 <- predict(rfFit4, NewProducts)
pred4
## [1] 440.042120 494.868771 -4.982187 -10.157833 67.055072 -93.583244
## [7] 1874.913277 224.328088 -81.227054 1778.334768 2341.587789 180.047661
## [13] 553.688351 -38.750876 224.328088 2030.073331 180.973468 180.973468
## [19] -126.554039 -31.038513 48.786293 79.830641 79.830641 2341.587789
pred10 <- predict(rfFit4, testing)
cor(pred10, testing$Volume)^2
## [1] 0.6894207
Not the model we are looking for, negative values..
Linear Regression Model
rfFit5 <- train(Volume~., data = training, method = "lm", tuneLength = 10, trControl = trainControl(method = "cv"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
#training results
rfFit5
## Linear Regression
##
## 61 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 54, 55, 57, 53, 55, 55, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 964.7516 0.8032241 575.9304
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
rfFit5$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 964.7516 0.8032241 575.9304 866.8826 0.1860898 411.7883
pred5 <- predict(rfFit5, NewProducts)
pred5
## 1 2 3 4 5 6
## -18.427438 -112.067371 171.105871 -111.744637 -335.672332 125.275315
## 7 8 9 10 11 12
## 1477.584915 118.384972 44.360524 929.697747 5214.394008 266.309738
## 13 14 15 16 17 18
## 224.787470 -33.299396 138.772203 4756.698595 92.598201 113.329780
## 19 20 21 22 23 24
## -7.245702 -14.956643 -1762.500312 -490.669438 -724.191266 6520.838798
pred10 <- predict(rfFit5, testing)
cor(pred10, testing$Volume)^2
## [1] 0.7012356
Looks like our LM model is overfitting and also predicting some negative values.
Polynomial Kernel Regularized Least Squares
rfFit6 <- train(Volume~., data = training, method = "krlsPoly", tuneLength = 10,
trControl = trainControl(method = "cv"))
#training results
rfFit6
## Polynomial Kernel Regularized Least Squares
##
## 61 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 55, 55, 54, 56, 56, 54, ...
## Resampling results across tuning parameters:
##
## degree RMSE Rsquared MAE
## 1 10216.387 0.5835669 8962.3431
## 2 1000.329 0.6859279 728.8218
## 3 1027.118 0.6825610 762.2832
##
## Tuning parameter 'lambda' was held constant at a value of NA
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were lambda = NA and degree = 2.
rfFit6$results
## lambda degree RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 NA 1 10216.387 0.5835669 8962.3431 6135.611 0.3065725 5886.5167
## 2 NA 2 1000.329 0.6859279 728.8218 1140.508 0.2588386 467.1964
## 3 NA 3 1027.118 0.6825610 762.2832 1132.043 0.2435466 466.1213
pred6 <- predict(rfFit6, NewProducts)
pred6
## [1] 662.4859 646.9589 649.7113 640.7030 650.0035 652.4429 712.0468 653.7393
## [9] 649.1024 680.1917 768.4608 654.2306 661.6993 642.0568 653.1575 769.5608
## [17] 660.4541 660.7129 647.0344 646.6327 769.2336 673.1936 713.5219 770.3975
pred10 <- predict(rfFit6, testing)
cor(pred10, testing$Volume)^2
## [1] 0.7399693
It looks like our model might be underfitting, still, far away from the results we obtained from our RF model.
Looks like the Random Forest regression Model is our go to model to try and predict the sales volume in this particular scenario.
Lets use those predictions and add them to our dataset.
output <- NewProducts
output$predictions <- pred3
write.csv(output, file="C2.T3output.csv", row.names = TRUE)