The document is for Juvo offline test. We will use the dataset, which contains the sales rank and the average daily sales in the last 30 days for sold products, to build a model, and then input the sales rank of a new product, to predict its daily sales.
According to the guildline and the dataset, since the sales rank is only meaningful and comparable under the same category and the dataset contains four categories, where each of them contains sales rank and daily sales of 100 sold products, without loss of generality, we only build model for the “Toy & Games” category. Note that if the new product’s category is unknown or beyond these four catefories, then the overall dataset is meaningless.
We first check out working dictionary and then put the dataset file into the same dictionary:
getwd()
## [1] "/Users/ccli"
Then load dataset:
data=read.csv("juvotk_data.csv")
Take a look of the dataset:
head(data)
## category sales_rank last_30d_sales_avg
## 1 [Toys & Games] 35613 0.7333
## 2 [Toys & Games] 6301 8.4333
## 3 [Toys & Games] 7335 1.0000
## 4 [Toys & Games] 15990 0.1000
## 5 [Toys & Games] 213 3.6667
## 6 [Toys & Games] 8917 1.0000
By the above, we only take the first category “Toy & Games” to build the model, so we subset the dataset:
data=data[2:101,]
In the following, we will use “caret” package to build the model, so please install this package and load it:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
We randomly use the first half of the dataset as the training set to build the model and the remaining as the test set for cross-validation.
set.seed(333)
inTrain<-createDataPartition(y=data$sales_rank,p=0.5,list=FALSE)
trainData<-data[inTrain,]
testData<-data[-inTrain,]
head(trainData)
## category sales_rank last_30d_sales_avg
## 2 [Toys & Games] 6301 8.4333
## 3 [Toys & Games] 7335 1.0000
## 5 [Toys & Games] 213 3.6667
## 8 [Toys & Games] 21965 0.4333
## 9 [Toys & Games] 9323 2.3000
## 11 [Toys & Games] 7432 2.5667
Then take a look at the distribution within the training set:
It is obvious that linear model may not be a good prediction, but we can still try first.
lm1<-lm(last_30d_sales_avg~log10(sales_rank),data=trainData)
summary(lm1)
##
## Call:
## lm(formula = last_30d_sales_avg ~ log10(sales_rank), data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5516 -0.8632 -0.2702 0.2783 9.1324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.3828 1.9911 6.721 1.64e-08 ***
## log10(sales_rank) -2.6475 0.4573 -5.789 4.65e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.877 on 50 degrees of freedom
## Multiple R-squared: 0.4013, Adjusted R-squared: 0.3893
## F-statistic: 33.51 on 1 and 50 DF, p-value: 4.649e-07
We add the linear model to the previous graph and take a look at the fitness:
plot(log10(trainData$sales_rank),trainData$last_30d_sales_avg,pch=19,col="blue",xlab="Sales Rank in log_10", ylab = "Avg Sales in last 30Days")
lines(log10(trainData$sales_rank),lm1$fitted,lwd=3)
Given a new product’s sales rank, we can then predict its daily sales. Let the rank is 1,000.
newdata<-data.frame(sales_rank=1000)
predict(lm1,newdata)
## 1
## 5.440202
Then we evaluate the performance of the linear model. First we draw the graph to demostrate the fitness difference between the training and the testing set.
par(mfrow=c(1,2))
plot(log10(trainData$sales_rank),trainData$last_30d_sales_avg,pch=19,main="Traing Set",col="blue",xlab="Sales Rank in log_10", ylab = "Avg Sales in last 30Days")
lines(log10(trainData$sales_rank),predict(lm1),lwd=3)
plot(log10(testData$sales_rank),testData$last_30d_sales_avg,pch=19,main="Testing Set",col="blue",xlab="Sales Rank in log_10", ylab = "Avg Sales in last 30Days")
lines(log10(testData$sales_rank),predict(lm1,newdata=testData),lwd=3)
Further, we calculate Roor Mean Squared Error(RMSE) for these two sets respectively to obtain training/testing errors:
For RMSE on trainging:
sqrt(sum((lm1$fitted-trainData$last_30d_sales_avg)^2))
## [1] 13.27231
For RMSE on testing:
sqrt(sum((predict(lm1,newdata=testData)-testData$last_30d_sales_avg)^2))
## [1] 11.97441
RMSE will be the main evaluation for us to evaluate the model we built.
We can use caret package to do the same process
lm2<-train(last_30d_sales_avg~log10(sales_rank),data=trainData,method="lm")
summary(lm2$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5516 -0.8632 -0.2702 0.2783 9.1324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.3828 1.9911 6.721 1.64e-08 ***
## `log10(sales_rank)` -2.6475 0.4573 -5.789 4.65e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.877 on 50 degrees of freedom
## Multiple R-squared: 0.4013, Adjusted R-squared: 0.3893
## F-statistic: 33.51 on 1 and 50 DF, p-value: 4.649e-07
Besides Linear Model, we can build different models for training set. A complete lists of models provied by caret package can be seen by click here.
There are plenty of models for regression. For example, we can build the model with Non-Negative Least Squares, k-Nearest Neighbors, and Bagged CART.
nnls<-train(last_30d_sales_avg~log10(sales_rank),data=trainData,method="nnls")
## Loading required package: nnls
summary(nnls$finalModel)
## Length Class Mode
## x 1 -none- numeric
## deviance 1 -none- numeric
## residuals 52 -none- numeric
## fitted 52 -none- numeric
## mode 1 -none- numeric
## passive 1 -none- numeric
## bound 0 -none- logical
## nsetp 1 -none- numeric
## xNames 1 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
knn<-train(last_30d_sales_avg~log10(sales_rank),data=trainData,method="knn")
summary(knn$finalModel)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 1 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
treebag<-train(last_30d_sales_avg~log10(sales_rank),data=trainData,method="treebag")
## Loading required package: ipred
## Loading required package: plyr
## Loading required package: e1071
summary(treebag$finalModel)
## Length Class Mode
## y 52 -none- numeric
## X 0 -none- NULL
## mtrees 25 -none- list
## OOB 1 -none- logical
## comb 1 -none- logical
## xNames 1 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
For RMSE on testing respectively:
sqrt(sum((predict(nnls,newdata=testData)-testData$last_30d_sales_avg)^2))
## [1] 15.12192
sqrt(sum((predict(knn,newdata=testData)-testData$last_30d_sales_avg)^2))
## [1] 12.72697
sqrt(sum((predict(treebag,newdata=testData)-testData$last_30d_sales_avg)^2))
## [1] 13.69042
Currently, for the algorithms we have tried, Linar Model is a simple and accurate model for the prediction. The main reason is we only have one feature in the training set so the information is insufficient for complicated models. For further research, we may need more features of the sold products, for example, the price and weight, and then try to build more regression models to predict the daily sales. If overfitting occurs, a possible way to solve this problem is to use regularized model by adjusting the value of lambda.