Introduction

The document is for Juvo offline test. We will use the dataset, which contains the sales rank and the average daily sales in the last 30 days for sold products, to build a model, and then input the sales rank of a new product, to predict its daily sales.

Load the dataset

According to the guildline and the dataset, since the sales rank is only meaningful and comparable under the same category and the dataset contains four categories, where each of them contains sales rank and daily sales of 100 sold products, without loss of generality, we only build model for the “Toy & Games” category. Note that if the new product’s category is unknown or beyond these four catefories, then the overall dataset is meaningless.

We first check out working dictionary and then put the dataset file into the same dictionary:

getwd()
## [1] "/Users/ccli"

Then load dataset:

data=read.csv("juvotk_data.csv")

Take a look of the dataset:

head(data)
##         category sales_rank last_30d_sales_avg
## 1 [Toys & Games]      35613             0.7333
## 2 [Toys & Games]       6301             8.4333
## 3 [Toys & Games]       7335             1.0000
## 4 [Toys & Games]      15990             0.1000
## 5 [Toys & Games]        213             3.6667
## 6 [Toys & Games]       8917             1.0000

By the above, we only take the first category “Toy & Games” to build the model, so we subset the dataset:

data=data[2:101,]

Build training and testing data

In the following, we will use “caret” package to build the model, so please install this package and load it:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2

We randomly use the first half of the dataset as the training set to build the model and the remaining as the test set for cross-validation.

set.seed(333)
inTrain<-createDataPartition(y=data$sales_rank,p=0.5,list=FALSE)
trainData<-data[inTrain,]
testData<-data[-inTrain,]
head(trainData)
##          category sales_rank last_30d_sales_avg
## 2  [Toys & Games]       6301             8.4333
## 3  [Toys & Games]       7335             1.0000
## 5  [Toys & Games]        213             3.6667
## 8  [Toys & Games]      21965             0.4333
## 9  [Toys & Games]       9323             2.3000
## 11 [Toys & Games]       7432             2.5667

Then take a look at the distribution within the training set:

Use Linear Model to predict first

It is obvious that linear model may not be a good prediction, but we can still try first.

lm1<-lm(last_30d_sales_avg~log10(sales_rank),data=trainData)
summary(lm1)
## 
## Call:
## lm(formula = last_30d_sales_avg ~ log10(sales_rank), data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5516 -0.8632 -0.2702  0.2783  9.1324 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.3828     1.9911   6.721 1.64e-08 ***
## log10(sales_rank)  -2.6475     0.4573  -5.789 4.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.877 on 50 degrees of freedom
## Multiple R-squared:  0.4013, Adjusted R-squared:  0.3893 
## F-statistic: 33.51 on 1 and 50 DF,  p-value: 4.649e-07

We add the linear model to the previous graph and take a look at the fitness:

plot(log10(trainData$sales_rank),trainData$last_30d_sales_avg,pch=19,col="blue",xlab="Sales Rank in log_10", ylab = "Avg Sales in last 30Days")
lines(log10(trainData$sales_rank),lm1$fitted,lwd=3)

Given a new product’s sales rank, we can then predict its daily sales. Let the rank is 1,000.

newdata<-data.frame(sales_rank=1000)
predict(lm1,newdata)
##        1 
## 5.440202

Then we evaluate the performance of the linear model. First we draw the graph to demostrate the fitness difference between the training and the testing set.

par(mfrow=c(1,2))
plot(log10(trainData$sales_rank),trainData$last_30d_sales_avg,pch=19,main="Traing Set",col="blue",xlab="Sales Rank in log_10", ylab = "Avg Sales in last 30Days")
lines(log10(trainData$sales_rank),predict(lm1),lwd=3)
plot(log10(testData$sales_rank),testData$last_30d_sales_avg,pch=19,main="Testing Set",col="blue",xlab="Sales Rank in log_10", ylab = "Avg Sales in last 30Days")
lines(log10(testData$sales_rank),predict(lm1,newdata=testData),lwd=3)

Further, we calculate Roor Mean Squared Error(RMSE) for these two sets respectively to obtain training/testing errors:

For RMSE on trainging:

sqrt(sum((lm1$fitted-trainData$last_30d_sales_avg)^2))
## [1] 13.27231

For RMSE on testing:

sqrt(sum((predict(lm1,newdata=testData)-testData$last_30d_sales_avg)^2))
## [1] 11.97441

RMSE will be the main evaluation for us to evaluate the model we built.

Use caret package to train more models

We can use caret package to do the same process

lm2<-train(last_30d_sales_avg~log10(sales_rank),data=trainData,method="lm")
summary(lm2$finalModel)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5516 -0.8632 -0.2702  0.2783  9.1324 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          13.3828     1.9911   6.721 1.64e-08 ***
## `log10(sales_rank)`  -2.6475     0.4573  -5.789 4.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.877 on 50 degrees of freedom
## Multiple R-squared:  0.4013, Adjusted R-squared:  0.3893 
## F-statistic: 33.51 on 1 and 50 DF,  p-value: 4.649e-07

Besides Linear Model, we can build different models for training set. A complete lists of models provied by caret package can be seen by click here.

There are plenty of models for regression. For example, we can build the model with Non-Negative Least Squares, k-Nearest Neighbors, and Bagged CART.

nnls<-train(last_30d_sales_avg~log10(sales_rank),data=trainData,method="nnls")
## Loading required package: nnls
summary(nnls$finalModel)
##             Length Class      Mode     
## x            1     -none-     numeric  
## deviance     1     -none-     numeric  
## residuals   52     -none-     numeric  
## fitted      52     -none-     numeric  
## mode         1     -none-     numeric  
## passive      1     -none-     numeric  
## bound        0     -none-     logical  
## nsetp        1     -none-     numeric  
## xNames       1     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    1     -none-     logical
knn<-train(last_30d_sales_avg~log10(sales_rank),data=trainData,method="knn")
summary(knn$finalModel)
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      1      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   1      -none-     logical
treebag<-train(last_30d_sales_avg~log10(sales_rank),data=trainData,method="treebag")
## Loading required package: ipred
## Loading required package: plyr
## Loading required package: e1071
summary(treebag$finalModel)
##             Length Class      Mode     
## y           52     -none-     numeric  
## X            0     -none-     NULL     
## mtrees      25     -none-     list     
## OOB          1     -none-     logical  
## comb         1     -none-     logical  
## xNames       1     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    1     -none-     logical

For RMSE on testing respectively:

sqrt(sum((predict(nnls,newdata=testData)-testData$last_30d_sales_avg)^2))
## [1] 15.12192
sqrt(sum((predict(knn,newdata=testData)-testData$last_30d_sales_avg)^2))
## [1] 12.72697
sqrt(sum((predict(treebag,newdata=testData)-testData$last_30d_sales_avg)^2))
## [1] 13.69042

Conclusion

Currently, for the algorithms we have tried, Linar Model is a simple and accurate model for the prediction. The main reason is we only have one feature in the training set so the information is insufficient for complicated models. For further research, we may need more features of the sold products, for example, the price and weight, and then try to build more regression models to predict the daily sales. If overfitting occurs, a possible way to solve this problem is to use regularized model by adjusting the value of lambda.