This paper presents a solution for the prediction task which was published by Rossmann on Kaggle in September 2015 as a Kaggle Competition. The task was to forecast sales for 1115 Rossmann stores in August and September 2015 based on historical data from January 2013 to July 2015.
The task can be accessed at the following URL: Rossmann Store Sales
The data to be used to solve the task were published in three CSV files on Kaggle’s site. The files are the following:
train.csv: Contains the training data on which the predictions must be based. The variables included in this file are as follows:t <- read.csv('F:/prog/RStudio/rossmann/train.csv',colClasses=c("factor","factor","Date","integer","integer","factor","factor","factor","factor"))
t[,6] <- t[,6] == 1
t[,7] <- t[,7] == 1
t[,9] <- t[,9] == 1
summary(t)
## Store DayOfWeek Date Sales
## 1 : 942 1:144730 Min. :2013-01-01 Min. : 0
## 10 : 942 2:145664 1st Qu.:2013-08-17 1st Qu.: 3727
## 1001 : 942 3:145665 Median :2014-04-02 Median : 5744
## 1002 : 942 4:145845 Mean :2014-04-11 Mean : 5774
## 1003 : 942 5:145845 3rd Qu.:2014-12-12 3rd Qu.: 7856
## 1005 : 942 6:144730 Max. :2015-07-31 Max. :41551
## (Other):1011557 7:144730
## Customers Open Promo StateHoliday
## Min. : 0.0 Mode :logical Mode :logical 0:986159
## 1st Qu.: 405.0 FALSE:172817 FALSE:629129 a: 20260
## Median : 609.0 TRUE :844392 TRUE :388080 b: 6690
## Mean : 633.1 NA's :0 NA's :0 c: 4100
## 3rd Qu.: 837.0
## Max. :7388.0
##
## SchoolHoliday
## Mode :logical
## FALSE:835488
## TRUE :181721
## NA's :0
##
##
##
The variables are explained briefly on the competition site, which I will not repeat here. It is worth noting that the data contain a separate DayOfWeek variable, which is really redundant since this information can be easily extracted from the Date variable. In the above code I interpreted the 1 and 0 values that were provided for the boolean variables in the respective columns of the data file to logical TRUE and FALSE values. Although this is not immediately apparent in the documentation but becomes clear when the data are inspected, the SchoolHoliday variable is TRUE for the days of the Easter, summer, autumn and Christmas school holidays but FALSE for days like Sunday or state holidays when schools are also closed.
store.csv: Provides information on the Rossmann stores, especially whether the store is participating in a continuing promotion programme and the distance of the nearest competition store. The type and assortment size of the store are also characterized.s <- read.csv('F:/prog/RStudio/rossmann/store.csv',colClasses=c("factor","factor","factor","integer","integer","integer","factor","integer","integer","character"))
s[,7] <- s[,7] == 1
summary(s)
## Store StoreType Assortment CompetitionDistance
## 1 : 1 a:602 a:593 Min. : 20.0
## 10 : 1 b: 17 b: 9 1st Qu.: 717.5
## 100 : 1 c:148 c:513 Median : 2325.0
## 1000 : 1 d:348 Mean : 5404.9
## 1001 : 1 3rd Qu.: 6882.5
## 1002 : 1 Max. :75860.0
## (Other):1109 NA's :3
## CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2
## Min. : 1.000 Min. :1900 Mode :logical
## 1st Qu.: 4.000 1st Qu.:2006 FALSE:544
## Median : 8.000 Median :2010 TRUE :571
## Mean : 7.225 Mean :2009 NA's :0
## 3rd Qu.:10.000 3rd Qu.:2013
## Max. :12.000 Max. :2015
## NA's :354 NA's :354
## Promo2SinceWeek Promo2SinceYear PromoInterval
## Min. : 1.0 Min. :2009 Length:1115
## 1st Qu.:13.0 1st Qu.:2011 Class :character
## Median :22.0 Median :2012 Mode :character
## Mean :23.6 Mean :2012
## 3rd Qu.:37.0 3rd Qu.:2013
## Max. :50.0 Max. :2015
## NA's :544 NA's :544
s[2,]
## Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth
## 2 2 a a 570 11
## CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear
## 2 2007 TRUE 13 2010
## PromoInterval
## 2 Jan,Apr,Jul,Oct
As the example shows, the PromoInterval is presented in the table as a string list of month name abbreviations. It is not immediately clear to what extent the information in this file is useful in calculating our predictions. Initially I will ignore this table and work with the training file only.
test.csv: This table contains the same columns as train.csv, but the sales and the number of customers are unknown. The task is to predict sales only. Note that the “test” table cannot be used for testing purposes directly since it does not include any relevant data. It is used only for generating the predictions, which are then submitted to Kaggle’s evaluation system, which in turn assesses predictions using a root mean squared percentage error metric.ttest <- read.csv('F:/prog/RStudio/rossmann/test.csv',colClasses=c("numeric","factor","factor","Date","integer","integer","factor","factor"))
ttest[,5] <- ttest[,5] == 1
ttest[,6] <- ttest[,6] == 1
ttest[,8] <- ttest[,8] == 1
summary(ttest)
## Id Store DayOfWeek Date
## Min. : 1 1 : 48 1:5992 Min. :2015-08-01
## 1st Qu.:10273 10 : 48 2:5992 1st Qu.:2015-08-12
## Median :20545 100 : 48 3:5992 Median :2015-08-24
## Mean :20545 1000 : 48 4:5992 Mean :2015-08-24
## 3rd Qu.:30816 1003 : 48 5:5136 3rd Qu.:2015-09-05
## Max. :41088 1004 : 48 6:5992 Max. :2015-09-17
## (Other):40800 7:5992
## Open Promo StateHoliday SchoolHoliday
## Mode :logical Mode :logical 0:40908 Mode :logical
## FALSE:5984 FALSE:24824 a: 180 FALSE:22866
## TRUE :35093 TRUE :16264 TRUE :18222
## NA's :11 NA's :0 NA's :0
##
##
##
head(ttest)
## Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday
## 1 1 1 4 2015-09-17 TRUE TRUE 0 FALSE
## 2 2 3 4 2015-09-17 TRUE TRUE 0 FALSE
## 3 3 7 4 2015-09-17 TRUE TRUE 0 FALSE
## 4 4 8 4 2015-09-17 TRUE TRUE 0 FALSE
## 5 5 9 4 2015-09-17 TRUE TRUE 0 FALSE
## 6 6 10 4 2015-09-17 TRUE TRUE 0 FALSE
ttest <- ttest[2:8]
The first column is new, it is simply a row ID that will not be necessary since R automatically adds a row number to tables anyway that are retained when the rows of the table are reordered.
First of all, it should be noted that the customers variable in the data set serves no purpose: On the one hand, it is not given in the input variables in the test set (which is unfortunate, since the number of customers is presumably the best predictor of the sales of a store), so fitting a model to this variable would be useless from the perspective of the task. On the other hand, the task is not to predict the number of customers, so fitting a model based on the given variables in the test set to predict the number of customers surely makes no sense. Therefore this variable can be dropped from the training set.
As can be expected, sales of closed shops are zero:
summary(t$Sales[!t$Open])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
Therefore, predicting sales for closed shops is trivial, so I remove all data for closed shops from the data set, which reduces its size by almost one-fifth. The Open column also becomes superfluous.
length(t$Sales[!t$Open]) / nrow(t)
## [1] 0.1698933
t <- t[t$Open,]
Since the usual machine learning algorithms cannot handle dates as an input variable directly, I extract a number of numeric variables from this attribute: year, month, day and week number. The year, month and day variables are obvious. Week number might be useful, as it is coarser than the day but finer than the month variable, and will be necessary if I want to take into consideration the Promo2 type of promotion later (since the starting week of the store’s participation is given in week number).
Although all of these are strictly speaking factor variables, I will store them as numeric values for now and adjust their data type later.
t$DateYear <- as.numeric(strftime(t$Date, format="%y")) # the strftime function extracts a string from the date, so this must be transformed to a numeric value
t$DateMonth <- as.numeric(strftime(t$Date, format="%m"))
t$DateDay <- as.numeric(strftime(t$Date, format="%d"))
t$DateWeek <- as.numeric(strftime(t$Date, format="%W"))
t <- t[c(1,4,7:13,2)] # After having extracted the components of the date, I am dropping the "Customers" and "Date" variables and moving the "DayOfWeek" variable into the last column since this way it is grouped together with the other date components
head(t)
## Store Sales Promo StateHoliday SchoolHoliday DateYear DateMonth DateDay
## 1 1 5263 TRUE 0 TRUE 15 7 31
## 2 2 6064 TRUE 0 TRUE 15 7 31
## 3 3 8314 TRUE 0 TRUE 15 7 31
## 4 4 13995 TRUE 0 TRUE 15 7 31
## 5 5 4822 TRUE 0 TRUE 15 7 31
## 6 6 5651 TRUE 0 TRUE 15 7 31
## DateWeek DayOfWeek
## 1 30 5
## 2 30 5
## 3 30 5
## 4 30 5
## 5 30 5
## 6 30 5
As we saw at the end of the first section, not all shops are used in the test data. There are 48 days for all stores in the test set, but only 40800 rows, which means that we only need to generate models for 850 stores instead of the total of 1115.
40800 / 48
## [1] 850
Since I will be fitting models for each shop separately in order to predict sales in the test period, data for shops that are absent from the test set serve no purpose. They will not be used to fit models anyway. Therefore, I collect the store IDs from the test set in a separate variable.
teststores <- as.numeric(as.character(unique(ttest$Store)))
teststores[1:10]
## [1] 1 3 7 8 9 10 11 12 13 14
This variable is used to filter out the data for the unused stores.
nrow(t)
## [1] 844392
t <- t[t$Store %in% teststores,]
nrow(t)
## [1] 641985
This reduces the data to be processed by almost 25 per cent.
It is also apparent from the summary of the test data that there are some NA values in the Open variable which must be filled in for the predictions to be generated properly
ttest[is.na(ttest$Open),]
## Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday
## 480 622 4 2015-09-17 NA TRUE 0 FALSE
## 1336 622 3 2015-09-16 NA TRUE 0 FALSE
## 2192 622 2 2015-09-15 NA TRUE 0 FALSE
## 3048 622 1 2015-09-14 NA TRUE 0 FALSE
## 4760 622 6 2015-09-12 NA FALSE 0 FALSE
## 5616 622 5 2015-09-11 NA FALSE 0 FALSE
## 6472 622 4 2015-09-10 NA FALSE 0 FALSE
## 7328 622 3 2015-09-09 NA FALSE 0 FALSE
## 8184 622 2 2015-09-08 NA FALSE 0 FALSE
## 9040 622 1 2015-09-07 NA FALSE 0 FALSE
## 10752 622 6 2015-09-05 NA FALSE 0 FALSE
We see that all NA values pertain to a single store (622), they are for consecutive days except the Sundays in this period, and contain all other days (Mondays to Saturdays). Furthermore, there are four days at the end of the period when there is an active promotion running in this store. Since the intervening Sundays are explicitly marked as 0, we can assume that these are all open days. Therefore I set the NA values to true.
ttest[ttest$Store == 622 & ttest$Date > "2015-09-05" & ttest$DayOfWeek == 7,]
## Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday
## 3904 622 7 2015-09-13 FALSE FALSE 0 FALSE
## 9896 622 7 2015-09-06 FALSE FALSE 0 FALSE
ttest[is.na(ttest$Open),]$Open <- T
I examine the data according to the usual descriptive statistics.
summary(t$Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4848 6299 6855 8198 41550
sd(t$Sales)
## [1] 2953.504
hist(t$Sales,xlab="Sales Amount")
boxplot(t$Sales)
Sales above 20000 appear to be outliers in the data set, so these should be checked in order to ascertain that the data are not corrupted.
summary(t[t$Sales > 20000,])
## Store Sales Promo StateHoliday
## 1114 :429 Min. :20002 Mode :logical 0:2464
## 262 :407 1st Qu.:20747 FALSE:713 a: 23
## 251 :263 Median :21868 TRUE :1788 b: 10
## 562 :221 Mean :22902 NA's :0 c: 4
## 842 :188 3rd Qu.:23832
## 383 :123 Max. :41551
## (Other):870
## SchoolHoliday DateYear DateMonth DateDay
## Mode :logical Min. :13.00 Min. : 1.000 Min. : 1.00
## FALSE:1971 1st Qu.:13.00 1st Qu.: 4.000 1st Qu.: 5.00
## TRUE :530 Median :14.00 Median : 6.000 Median :16.00
## NA's :0 Mean :13.84 Mean : 6.904 Mean :15.05
## 3rd Qu.:14.00 3rd Qu.:11.000 3rd Qu.:23.00
## Max. :15.00 Max. :12.000 Max. :31.00
##
## DateWeek DayOfWeek
## Min. : 0.0 1:926
## 1st Qu.:15.0 2:367
## Median :26.0 3:231
## Mean :27.5 4:216
## 3rd Qu.:44.0 5:336
## Max. :52.0 6:255
## 7:170
It appears that it is simply the case that certain stores (1114, 262, etc.) consistently produce very high sales, which means that these sales data are very likely no mistakes but correct. It is also worth noting that the exceedingly high sales are spread across all kinds of dates (also over all days of the week, Monday being the best day for sales by far), school holiday or not, promo or not, etc. We can conclude that the data are probably completely dependable and of very high quality.
One aspect that must be checked is whether sufficient historical data are available for all stores. For this reason I count the number of rows for each store.
openDays <- aggregate(t$Store,list(t$Store),length)
head(openDays)
## Group.1 x
## 1 1 781
## 2 10 784
## 3 100 606
## 4 1000 622
## 5 1003 779
## 6 1004 622
We have historical data on about 781 days for store 1, etc. Statistics for all stores show that we have a rather balanced number of open days, with some outliers but very little variation otherwise. Even the store with the smallest number of days (592) was open for at least more than 1.5 years, so there should be enough data to base our predictions on.
summary(openDays)
## Group.1 x
## 1 : 1 Min. :592.0
## 10 : 1 1st Qu.:770.5
## 100 : 1 Median :779.0
## 1000 : 1 Mean :750.0
## 1003 : 1 3rd Qu.:781.0
## 1004 : 1 Max. :942.0
## (Other):850
Next, I investigate how sales are related to the input variables throughout the data set:
t.test(t[t$Promo,]$Sales,t[!t$Promo,]$Sales)
##
## Welch Two Sample t-test
##
## data: t[t$Promo, ]$Sales and t[!t$Promo, ]$Sales
## t = 306.39, df = 553080, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2146.504 2174.143
## sample estimates:
## mean of x mean of y
## 8052.484 5892.160
Not surprisingly, promotions have a very significant effect on sales, which are about 40 per cent higher on average during promotions.
t.test(t[t$StateHoliday != 0,]$Sales,t[t$StateHoliday == 0,]$Sales)
##
## Welch Two Sample t-test
##
## data: t[t$StateHoliday != 0, ]$Sales and t[t$StateHoliday == 0, ]$Sales
## t = 7.9118, df = 553.16, p-value = 1.39e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1844.204 3062.364
## sample estimates:
## mean of x mean of y
## 9305.717 6852.433
t.test(t[t$SchoolHoliday,]$Sales,t[!t$SchoolHoliday,]$Sales)
##
## Welch Two Sample t-test
##
## data: t[t$SchoolHoliday, ]$Sales and t[!t$SchoolHoliday, ]$Sales
## t = 27.179, df = 179800, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 242.7369 280.4673
## sample estimates:
## mean of x mean of y
## 7066.096 6804.493
State holidays of all types have a very large effect on sales as well, whereas the impact of school holidays is relatively negligible, although statistically still highly significant.
tapply(t$Sales,t$DateYear,mean)
## 13 14 15
## 6695.345 6930.516 7012.287
Sales are growing each year, which is probably due to the fact that Rossmann’s prices are being raised on average little by little (even though there is essentially no inflation in Germany).
tapply(t$Sales,t$DateMonth,mean)
## 1 2 3 4 5 6 7 8
## 6460.397 6494.003 6866.441 6938.716 7023.093 6926.269 6874.591 6586.886
## 9 10 11 12
## 6436.536 6476.993 7076.840 8447.998
plot(tapply(t$Sales,t$DateMonth,mean),xlab="Month",ylab="Mean of Sales")
As expected, the sales are highest in December, during the holiday season, whereas they are about constant with little variation during the other months.
tapply(t$Sales,t$DayOfWeek,mean)
## 1 2 3 4 5 6 7
## 8006.594 6907.936 6593.053 6609.182 6973.977 6040.803 8030.835
Sales are highest on Mondays and Sundays and at a very even level from Tuesday to Friday. Sales are lowest on Saturdays, which might have to do with the fact that some stores close earlier on Saturday. Stores in Germany are generally closed on Sundays and only open on a few days a year, presumably at times when very good sales can be expected, which probably explains the high sales seen on Sundays.
plot(tapply(t$Sales,t$DateDay,mean),xlab="Day of Month",ylab="Mean of Sales")
Most people evidently get their salaries at the end of the month, around the 29th or 30th, sometimes a few days earlier. As expected, sales (and spendings) are lowest during the days prior to the salaries being paid. Surprisingly there is also a slightly lower hump around the middle of each month. The day of the month has a very significant effect on sales altogether.
plot(tapply(t$Sales,t$DateWeek,mean),xlab="Week Number",ylab="Mean of Sales")
On first sight the plot for the week numbers looks extremely noisy. However, what we are seeing here is in fact the rather strong day-of-month-related fluctuation that we saw directly above superposed over the relatively flat month-by-month variation. Nevertheless, some noise is surely present because the range of days covered by each week is different each year. For example, the fifth week of 2013 was Jan 28 to Feb 3, whereas in 2015 it was Jan 26 to Feb 1. These ranges of dates are associated with different average sales, which will appear as random noise in the above plot.
The question remains whether the weekly data, despite looking noisy, are useful for predicting sales. To answer this question, I plot the mean sales by week number for each individual year in the following figure, with the years color-coded:
plot(tapply(t[t$DateYear == 13,]$Sales,t[t$DateYear == 13,]$DateWeek,mean),xlab="Week Number",ylab="Mean of Sales")
points(tapply(t[t$DateYear == 14,]$Sales,t[t$DateYear == 14,]$DateWeek,mean),col="red")
points(tapply(t[t$DateYear == 15,]$Sales,t[t$DateYear == 15,]$DateWeek,mean),col="blue")
legend("top",c("2013","2014","2015"),col=c("black","red","blue"),pch=c(1,1,1))
The results are relatively surprising. Despite the very noisy appearance of the earlier week number plot, it seems that the sales are in fact rather consistent per week number, i.e. the black, blue and red circles are very close to each other for most weeks, e.g. weeks 8 to 11 or 27. The only weeks where we see really huge differences between the years are weeks 14 to 17, but these differences are due to the fact that Easter is celebrated in different weeks each year. In 2013 (black) it was week 14, in 2014 (red) week 17 and in 2015 (blue) week 15, which are exactly the three conspicuous outlier circles visible in the plot. Thus we can conclude that, surprisingly, week numbers are presumably a rather good predictor variable for sales. It must be kept in mind, however, that the relationship between the week number and the sales is not linear. Unfortunately, for the weeks that must be predicted (from week 32 onward) the gap between the 2013 and 2014 data is often very wide (see especially weeks 32 to 40) as compared to the weeks in the first half of the year. Unfortunately the task is to predict sales for weeks 31 to 38, so this means that week number will be a less reliable predictor than we would hope for exactly the crucial weeks.
Establishing a baseline prediction for this particular data set is relatively straightforward. We can assume that Sales are zero when the store is closed, and we can set the Sales to the mean sales value of non-closed days when it is not closed. Since I have already removed closed days from the data set, I will calculate the baseline for open days only.
The procedure is illustrated with store no. 1:
store <- t[t$Store == 1,]
mean(store$Sales) # mean sales when open
## [1] 4759.096
store$Prediction <- mean(store$Sales)
store$Error <- store$Prediction - store$Sales
The root mean squared error can then be easily calculated:
sqrt(mean((store$Sales-store$Prediction)^2))
## [1] 1011.458
It is worth checking whether the median (which is an equally simple descriptive statistic) possibly offers a better baseline prediction:
store$Prediction <- median(store$Sales)
sqrt(mean((store$Sales-store$Prediction)^2))
## [1] 1017.651
This result indicates that the median is a marginally worse predictor at least for the first store, so I will continue working with the mean.
The errors are calculated in the way described above for all stores:
errors <- rep(0,length(teststores))
means <- aggregate(t$Sales,list(t$Store),mean)
j <- 1
for ( i in teststores) {
errors[j] <- sqrt(mean((t[t$Store == i,]$Sales - means[means$Group.1 == i,2])^2))
j <- j + 1
}
The errors vector contains the mean squared errors for each individual store.
summary(errors)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 629.9 1356.0 1711.0 1771.0 2057.0 4760.0
sd(errors)
## [1] 571.9886
hist(errors)
Again the error for some stores is extremely high, so these are worth examining.
manyerrors <- cbind(as.numeric(as.character(teststores[errors > 3500])),errors[errors > 3500])
manyerrors
## [,1] [,2]
## [1,] 234 3504.596
## [2,] 251 3545.364
## [3,] 262 4666.399
## [4,] 335 4759.825
## [5,] 586 3537.216
## [6,] 756 3910.047
## [7,] 831 3615.355
## [8,] 842 3785.939
## [9,] 1014 4568.576
## [10,] 1027 3751.706
It appears that the stores with the highest mean squared error overlap with those that have produced the highest sales (we saw 251, 262 and 842 there in particular). This is reasonable, so the data seem to be correct.
Finally, the RMSE for the whole data set is calculated:
t$Prediction <- apply(t[1],1,function(x){means[means$Group.1 == x,2]})
sqrt(mean((t$Sales-t$Prediction)^2))
## [1] 1872.81
The aim is to reduce the MSE attained through modeling as much as possible below this amount.
I will first experiment with linear regression models. I train a separate model for each store on the complete set of features and assess the accuracy of the predictions of these models using 10-fold cross-validation on each store. There are about 800 days of data for every store, so the folds contain about 80 records each, i.e. the training sets are 720 records long.
storelm <- function(storeNumber) {
store <- t[t$Store == storeNumber,] # a store is selected
shuffledIndices <- sample(nrow(store)) # the data for the store are shuffled
store$Prediction <- 0
z <- nrow(store)
for (i in 1:10) { # 10-fold cross-validation
sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z) # 10 % of all data rows is selected
test <- store[shuffledIndices[sampleIndex],] # it is used as test set
train <- store[shuffledIndices[-sampleIndex],] # the rest is used as training set
modell <- lm(Sales ~ Promo + SchoolHoliday + DayOfWeek + as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + as.factor(DateWeek), train) # a linear model is fitted to the training set
store[shuffledIndices[sampleIndex],]$Prediction <- predict(modell,test) # predictions are generated for the test set based on the model
}
print(paste("Prediction for 13/6/1: ", store[store$DateYear == 13 & store$DateMonth == 6 & store$DateDay == 1,]$Prediction))
### For some reason that is not clear to me, the linear models generate very inaccurate predictions (usually sales above 100,000) for June 1, 2013 for all stores. I work around this problem by falling back to a less specific prediction for that date: the mean of all predictions generated for the current store.
store[store$DateYear == 13 & store$DateMonth == 6 & store$DateDay == 1,]$Prediction <- mean(store$Prediction)
print(paste("Training error: ", summary.lm(modell)[6])) # this is the training error of the model that was fitted last
sqrt(mean((store$Sales-store$Prediction)^2))
}
paste("RMSE: ", storelm(1))
## [1] "Prediction for 13/6/1: 67093.021139132"
## [1] "Training error: 543.679324084399"
## [1] "RMSE: 601.900441744124"
The RMSE for the linear model (around 600) is much better than the baseline (around 1000) for the first shop. The improvement is similar for the stores that had a very high error for the baseline predictions. In these cases the error is approximately halved, i.e. it goes down from about 4000 to about 2000.
print(system.time(print(paste("RMSE: ", storelm(1014)))))
## [1] "Prediction for 13/6/1: 161079.096915653"
## [1] "Training error: 1761.91561342843"
## [1] "RMSE: 2154.65616847458"
## user system elapsed
## 0.39 0.00 0.39
The only exception that I found is store 262, where the error is about twice as high as earlier. This will need to be addressed.
print(system.time(print(paste("RMSE: ", storelm(262)))))
## [1] "Prediction for 13/6/1: 22061.2693322737"
## [1] "Training error: 2593.43079952038"
## [1] "RMSE: 9983.24496923603"
## user system elapsed
## 0.37 0.00 0.38
I calculate a mean squared error score for each of the stores using the cross-validation function defined above, and then I average the RMSE scores, which is a slightly inaccurate way of calculating the RMSE for the whole data set (since not all stores have an exactly equal number of open days), but by far the simplest. Before doing this, however, the original table with the training data must be transformed into one where the input variables are treated as categorical and replaced by separate binary (logical) variables for each level. The latter step is necessary because when we try to run the storelm function as defined above, the predict function for the linear model crashes. This is due to a technical reason: for some stores certain values of the DateWeek variable (especially 52) are so sparse that they do not occur at all in some training folds, and since the lm function reduces the number of factor levels automatically if some levels are not instantiated, the factor variables in the ‘training set’ and ‘test set’ do not match on some iterations of the cross-validation process. Therefore I have to transform the variables manually, without reducing their levels.
tb <- t[c(1,2)]
tb <- cbind(tb,model.matrix(Sales ~ Promo + StateHoliday + SchoolHoliday + DayOfWeek + as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + as.factor(DateWeek), data = t))
storelm2 <- function(storeNumber) {
store <- tb[tb$Store == storeNumber,2:ncol(tb)] # the store number factor variable is removed since it is not needed
shuffledIndices <- sample(nrow(store))
store$Prediction <- 0
z <- nrow(store)
for (i in 1:10) {
sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)
test <- store[shuffledIndices[sampleIndex],]
train <- store[shuffledIndices[-sampleIndex],1:(ncol(store)-1)]
modell <- lm(Sales ~ ., train) # the linear model uses all variables
store[shuffledIndices[sampleIndex],]$Prediction <- predict(modell,test)
}
store[store$Prediction == max(store$Prediction),]$Prediction <- mean(store$Prediction)
sqrt(mean((store$Sales-store$Prediction)^2))
}
The function is checked to make sure it functions correctly. The prediction for store 1 is no worse than with the first version of the function, and strangely enough, it consistently fails to produce the anomalously high prediction error for store 262 despite doing essentially the same. Execution time is about twice as long but still below one second for each store, which is acceptable.
print(system.time(print(storelm2(1))))
## [1] 612.9252
## user system elapsed
## 0.72 0.11 0.83
print(system.time(print(storelm2(262))))
## [1] 1884.843
## user system elapsed
## 0.78 0.12 0.92
The RMSE scores are calculated for all stores. The total results are clearly lower than the baseline (which was close to 1900.
errors <- rep(0,length(teststores))
j <- 0
for ( i in teststores ) {
errors[j] <- storelm2(i)
j <- j + 1
}
mean(errors)
## [1] 1347.681
Next I use a random forest model to generate predictions. For some reason that is not clear to me, I was not able to run R’s random forest function on the binary variables, but using the original numeric values for the date components worked fine and produced very good results with a tree number of 200. I tried increasing this value up to 500 but only saw little improvement, while execution time grew significantly, so 200 represents a good compromise between quality and practicality. I abstracted out the modeling function from the cross-validation process that I wrote above for linear regression, storelm, so I can use the same function from now on (one for the numeric and one for the binary version of the data set) with different models. The modeling function is provided to the numericmodel function as an argument.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
numericmodel <- function(storeNumber,modelType,...) {
store <- t[t$Store == storeNumber,]
store$Promo <- as.numeric(store$Promo)
store$StateHoliday <- as.numeric(store$StateHoliday)
store$SchoolHoliday <- as.numeric(store$SchoolHoliday)
store$DateWeek <- as.numeric(store$DateWeek)
store$DayOfWeek <- as.numeric(store$DayOfWeek)
shuffledIndices <- sample(nrow(store))
store$Prediction <- 0
z <- nrow(store)
for (i in 1:10) {
sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)
test <- store[shuffledIndices[sampleIndex],]
train <- store[shuffledIndices[-sampleIndex],(2:(ncol(store)-1))]
modell <- modelType(Sales ~ ., train, ...)
store[shuffledIndices[sampleIndex],]$Prediction <- predict(modell,test)
}
sqrt(mean((store$Sales-store$Prediction)^2))
}
Results for stores 1 and 262 are clearly better than those that were attained using linear models. Execution takes a couple of seconds for each shop.
print(system.time(print(numericmodel(1,randomForest,ntree=200))))
## [1] 568.4792
## user system elapsed
## 2.34 0.03 2.39
print(system.time(print(numericmodel(262,randomForest,ntree=200))))
## [1] 1847.454
## user system elapsed
## 3.14 0.08 3.22
This is also executed for the whole data set in order to compare it to the linear model (and baseline) results.
errors <- rep(0,length(teststores))
j <- 0
for ( i in teststores ) {
errors[j] <- numericmodel(i,randomForest,ntree=200)
j <- j + 1
}
mean(errors)
## [1] 909.0821
max(errors)
## [1] 2569.596
The following code is the equivalent function for the binary data set. It produces similar (although slightly worse) results with random forests, but is much slower.
binarymodel <- function(storeNumber,modelType,...) {
store <- tb[tb$Store == storeNumber,2:ncol(tb)]
names(store) <- paste("V",c(1:ncol(store)),sep="") # renaming the column names to V1 through V108 because the current names cannot be used in formulae
names(store)[1] <- "Sales" # the target variable is still called 'Sales'
shuffledIndices <- sample(nrow(store))
store$Prediction <- 0
z <- nrow(store)
for (i in 1:10) {
sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)
test <- store[shuffledIndices[sampleIndex],]
train <- store[shuffledIndices[-sampleIndex],c(1,3:(ncol(store)-1))]
modell <- modelType(Sales ~ ., train, ...)
store[shuffledIndices[sampleIndex],]$Prediction <- predict(modell,test)
}
sqrt(mean((store$Sales-store$Prediction)^2))
}
Predicting sales for a single store in this way takes about half a minute, so it is not reasonable to run it on the whole data set.
print(system.time(print(binarymodel(1,randomForest,ntree=200))))
## [1] 601.6405
## user system elapsed
## 25.62 0.06 25.68
I also experimented with a different approach to the training data using random forests. So far I have always divided up the training data by store and constructed the models separately for each store. The reasoning behind this procedure is that the store variable is by far the strongest predictor for the sales, which I have not tried to prove but which seems a very reasonable assumption looking at the descriptive statistics of the data set.
Now I also tried the alternative approach of pooling all shops together and constructing a single model for all of them. The idea here is that changes in sales over time could follow strong general patterns which obviously cannot be established when we look at a single store but which could be captured when modeling the data for all stores in one go. This is clearly a much more resource-intensive solution: computing a model for a single store requires very little memory and is completed in a matter of seconds. Computing one for all stores at once requires gigabytes of memory and takes hours. Unfortunately this approach failed to yield acceptable results. When run on the first 100 stores with a forest size of 200, its execution takes hours but the RMSE for the predictions is slightly higher than the baseline.
The neural network model in R functions differently from the previous two models, so the earlier functions must be modified. I am using the binary version of the data set to train neural network models, and the target variable must be scaled to a value between -1 and 1 for the network to function correctly. Since the binary values are already either 0 or 1, no transformation is required for them. The neuralnet function does not allow the use of the shorthand notation for the formula to be fitted, which is usually expressed as “Sales ~ .”, meaning use all variables to predict sales. Instead, all variables must be explicitly enumerated in the right hand side of the formula, which is achieved by concatenating the column names.
A loop is executed to try different parameter settings for the neural network model function. I am using a single hidden layer, looping over different settings for the number of neurons in this layer between 1 and 15. A second loop is executed within this loop that tries different values for the threshold value that is the stopping criterion for the fitting process. The smaller this value, the longer the fitting runs and the closer the model fits the training data. I have also experimented with larger figures for the number of neurons and smaller values for the thresholds, but these did not improve the RMSE score. As seen in the output for the below process, more neurons and lower thresholds gradually worsen the accuracy of the predictions, which is, I suppose, due to overfitting.
The below code uses the default algorithm for fitting neural networks with the neuralnet function, which is called resilient backpropagation with weight backtracking. I have also experimented with the other algorithms available, but these do not yield significantly different results. The results around 600 that were achieved with 1 neuron for the first store represent a perfectly respectable level of accuracy, about as good as the linear regression model and not much worse than the random forest. However, the neuralnet model also has a very significant drawback: fitting a network just to one store with a threshold of 0.1 (which usually yields acceptable accuracy) takes several seconds, whereas linear models and random forests are fitted almost instantly. This means that fitting models for all stores can take about 2 hours. Lowering the threshold to 0.01 seems to improve accuracy very slightly, but takes about 5 times as long to fit, and is thus not practicable.
library(neuralnet)
## Loading required package: grid
## Loading required package: MASS
storenn <- function(storeNumber,...) {
store <- tb[tb$Store == storeNumber,2:ncol(tb)]
names(store) <- paste("V",c(1:108),sep="") # renaming the column names to V1 through V108 because the current names cannot be used in formulae
names(store)[1] <- "Sales" # the target variable is still called 'Sales'
for (hid in c(1,3)) { # This specifies the number of neurons in the hidden layer
for (trs in c(0.5,0.3,0.1,0.03)) { # This is the threshold value for the partial derivatives of the error function used as stopping criteria
shuffledData <- store[sample(nrow(store)),]
shuffledData$Prediction <- 0
### Here the sales data are scaled to values between -1 and 1
m1 <- mean(shuffledData$Sales)
shuffledData$Sales <- shuffledData$Sales - m1
m2 <- max(abs(shuffledData$Sales))
shuffledData$Sales <- shuffledData$Sales / m2
z <- nrow(store)
for (i in 1:10) {
ind <- 1:z
sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
test <- shuffledData[sampleIndex,3:(ncol(shuffledData)-1)]
train <- shuffledData[-sampleIndex,c(1,3:(ncol(shuffledData)-1))]
n <- names(train)
f <- as.formula(paste("Sales ~", paste(n[n != "Sales"], collapse = " + ")))
modell <- neuralnet(f, train, hidden=hid,thresh=trs)
c <- compute(modell,test)
c$net.result <- c$net.result * m2 + m1
shuffledData[sampleIndex,]$Prediction <- c$net.result
}
shuffledData$Sales <- shuffledData$Sales * m2 + m1
print(system.time(print(paste("hidden=",hid,"thresh=",trs,"RMSE=",sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))))))
}
}
}
storenn(1)
## [1] "hidden= 1 thresh= 0.5 RMSE= 631.150302011459"
## user system elapsed
## 0 0 0
## [1] "hidden= 1 thresh= 0.3 RMSE= 615.010357425616"
## user system elapsed
## 0 0 0
## [1] "hidden= 1 thresh= 0.1 RMSE= 653.620989227642"
## user system elapsed
## 0 0 0
## [1] "hidden= 1 thresh= 0.03 RMSE= 623.056264140328"
## user system elapsed
## 0 0 0
## [1] "hidden= 3 thresh= 0.5 RMSE= 666.2249655126"
## user system elapsed
## 0 0 0
## [1] "hidden= 3 thresh= 0.3 RMSE= 652.099525460725"
## user system elapsed
## 0 0 0
## [1] "hidden= 3 thresh= 0.1 RMSE= 632.071316702847"
## user system elapsed
## 0 0 0
## [1] "hidden= 3 thresh= 0.03 RMSE= 748.206699380426"
## user system elapsed
## 0 0 0
The total RMSE using neural networks for the whole data set is calculated with the following code. As expected, the results are not particularly satisfactory.
storenn2 <- function(storeNumber,...) {
store <- tb[tb$Store == storeNumber,2:ncol(tb)]
names(store) <- paste("V",c(1:108),sep="")
names(store)[1] <- "Sales"
shuffledData <- store[sample(nrow(store)),]
shuffledData$Prediction <- 0
m1 <- mean(shuffledData$Sales)
shuffledData$Sales <- shuffledData$Sales - m1
m2 <- max(abs(shuffledData$Sales))
shuffledData$Sales <- shuffledData$Sales / m2
z <- nrow(store)
for (i in 1:10) {
ind <- 1:z
sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
test <- shuffledData[sampleIndex,3:(ncol(shuffledData)-1)]
train <- shuffledData[-sampleIndex,c(1,3:(ncol(shuffledData)-1))]
n <- names(train)
f <- as.formula(paste("Sales ~", paste(n[n != "Sales"], collapse = " + ")))
modell <- neuralnet(f, train, ...)
c <- compute(modell,test)
c$net.result <- c$net.result * m2 + m1
shuffledData[sampleIndex,]$Prediction <- c$net.result
}
shuffledData$Sales <- shuffledData$Sales * m2 + m1
sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}
errors <- rep(0,max(as.numeric(t$Store)))
for ( i in 1:max(as.numeric(t$Store)) ) {
errors[i] <- storenn2(i, hidden=1,thresh=0.1)
}
mean(errors)
max(errors)
TR’s support vector machine implementation fitted to the binary data delivers results that are about as accurate using the best combination of parameters (in the below case, cost = 30 and gamma = 0.001) as those generated by random forests. However, it runs much slower and takes roughly 5 to 10 seconds to converge.
library(e1071)
for (cst in c(30000,3000,300,30)) {
for (gm in c(0.1,0.01,0.001)) {
print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",binarymodel(262,svm,cost=cst,gamma=gm)))))
}
}
## [1] "cost= 30000 gamma= 0.1 RMSE= 4635.56941656794"
## user system elapsed
## 8.24 0.08 8.33
## [1] "cost= 30000 gamma= 0.01 RMSE= 2329.39631254082"
## user system elapsed
## 7.60 0.06 7.69
## [1] "cost= 30000 gamma= 0.001 RMSE= 2008.14839687849"
## user system elapsed
## 12.06 0.06 12.12
## [1] "cost= 3000 gamma= 0.1 RMSE= 4639.68953226739"
## user system elapsed
## 8.27 0.04 8.30
## [1] "cost= 3000 gamma= 0.01 RMSE= 2330.94323499704"
## user system elapsed
## 7.53 0.07 7.61
## [1] "cost= 3000 gamma= 0.001 RMSE= 1952.40151632215"
## user system elapsed
## 11.93 0.03 11.98
## [1] "cost= 300 gamma= 0.1 RMSE= 4640.94882845657"
## user system elapsed
## 8.30 0.10 8.39
## [1] "cost= 300 gamma= 0.01 RMSE= 2337.58645068642"
## user system elapsed
## 7.62 0.01 7.63
## [1] "cost= 300 gamma= 0.001 RMSE= 1882.74394288866"
## user system elapsed
## 11.29 0.10 11.39
## [1] "cost= 30 gamma= 0.1 RMSE= 4633.19788001548"
## user system elapsed
## 8.26 0.08 8.35
## [1] "cost= 30 gamma= 0.01 RMSE= 2366.35393267501"
## user system elapsed
## 7.55 0.04 7.61
## [1] "cost= 30 gamma= 0.001 RMSE= 1827.67286903152"
## user system elapsed
## 7.78 0.11 7.89
This function is used in loops to establish the best combination of parameters. Unfortunately it turns out that the parameters that are ideal for some stores only yield average quality predictions for others and vice versa. For example, the below examples show that a gamma of 0.001 together with a C value between 10 and 100 results in the best predictions for store 262 (a store with high sales), whereas store 1 works best with C = 10000 and gamma = 0.01. Thus using these SVMs would probably require determining separate C and gamma values for each store and constructing the models for each store based on these different parameters. However, the quality of the SVM-based predictions is not high enough to make this worthwhile. Furthermore, training for some stores is rather slow.
for (cst in c(10000,3000,1000,300,100,30,10)) {
for (gm in c(0.03,0.01,0.003,0.001)) {
print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",binarymodel(262,svm,cost=cst,gamma=gm)))))
}
}
## [1] "cost= 10000 gamma= 0.03 RMSE= 4009.79422759253"
## user system elapsed
## 8.33 0.06 8.39
## [1] "cost= 10000 gamma= 0.01 RMSE= 2324.08524059366"
## user system elapsed
## 7.57 0.09 7.68
## [1] "cost= 10000 gamma= 0.003 RMSE= 1982.21652125073"
## user system elapsed
## 8.38 0.00 8.40
## [1] "cost= 10000 gamma= 0.001 RMSE= 1938.88806809072"
## user system elapsed
## 12.09 0.05 12.14
## [1] "cost= 3000 gamma= 0.03 RMSE= 3988.96976475929"
## user system elapsed
## 8.22 0.06 8.28
## [1] "cost= 3000 gamma= 0.01 RMSE= 2313.02805962884"
## user system elapsed
## 7.60 0.05 7.66
## [1] "cost= 3000 gamma= 0.003 RMSE= 1951.02806194563"
## user system elapsed
## 8.33 0.08 8.44
## [1] "cost= 3000 gamma= 0.001 RMSE= 1997.87421755011"
## user system elapsed
## 12.15 0.06 12.28
## [1] "cost= 1000 gamma= 0.03 RMSE= 4014.93475227269"
## user system elapsed
## 8.22 0.08 8.30
## [1] "cost= 1000 gamma= 0.01 RMSE= 2311.3030130797"
## user system elapsed
## 7.47 0.11 7.58
## [1] "cost= 1000 gamma= 0.003 RMSE= 1952.12609018553"
## user system elapsed
## 8.32 0.07 8.39
## [1] "cost= 1000 gamma= 0.001 RMSE= 1944.82550648585"
## user system elapsed
## 12.03 0.08 12.11
## [1] "cost= 300 gamma= 0.03 RMSE= 3999.97142692386"
## user system elapsed
## 8.22 0.03 8.27
## [1] "cost= 300 gamma= 0.01 RMSE= 2266.4608804984"
## user system elapsed
## 7.59 0.02 7.61
## [1] "cost= 300 gamma= 0.003 RMSE= 1918.97556971088"
## user system elapsed
## 8.27 0.06 8.33
## [1] "cost= 300 gamma= 0.001 RMSE= 1987.77042418125"
## user system elapsed
## 11.39 0.06 11.45
## [1] "cost= 100 gamma= 0.03 RMSE= 3975.40898905443"
## user system elapsed
## 8.19 0.08 8.26
## [1] "cost= 100 gamma= 0.01 RMSE= 2363.30100177786"
## user system elapsed
## 7.52 0.08 7.61
## [1] "cost= 100 gamma= 0.003 RMSE= 1953.59952188883"
## user system elapsed
## 8.31 0.11 8.42
## [1] "cost= 100 gamma= 0.001 RMSE= 1879.98943384919"
## user system elapsed
## 9.78 0.05 9.83
## [1] "cost= 30 gamma= 0.03 RMSE= 4027.71974430893"
## user system elapsed
## 8.24 0.07 8.32
## [1] "cost= 30 gamma= 0.01 RMSE= 2338.60640939273"
## user system elapsed
## 7.60 0.04 7.63
## [1] "cost= 30 gamma= 0.003 RMSE= 1959.45634773481"
## user system elapsed
## 8.22 0.07 8.31
## [1] "cost= 30 gamma= 0.001 RMSE= 1808.49837042223"
## user system elapsed
## 7.84 0.03 7.90
## [1] "cost= 10 gamma= 0.03 RMSE= 4032.16279240566"
## user system elapsed
## 8.22 0.08 8.30
## [1] "cost= 10 gamma= 0.01 RMSE= 2322.41934054244"
## user system elapsed
## 7.60 0.08 7.68
## [1] "cost= 10 gamma= 0.003 RMSE= 1908.31459364646"
## user system elapsed
## 7.63 0.05 7.70
## [1] "cost= 10 gamma= 0.001 RMSE= 1827.86932830604"
## user system elapsed
## 7.15 0.04 7.19
for (cst in c(10000,3000,1000,300,100,30,10)) {
for (gm in c(0.03,0.01,0.003,0.001)) {
print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",binarymodel(1,svm,cost=cst,gamma=gm)))))
}
}
## [1] "cost= 10000 gamma= 0.03 RMSE= 581.142879580527"
## user system elapsed
## 4.48 0.02 4.49
## [1] "cost= 10000 gamma= 0.01 RMSE= 604.978584459808"
## user system elapsed
## 3.26 0.03 3.29
## [1] "cost= 10000 gamma= 0.003 RMSE= 635.546530810994"
## user system elapsed
## 3.06 0.05 3.10
## [1] "cost= 10000 gamma= 0.001 RMSE= 702.374448471912"
## user system elapsed
## 2.95 0.03 2.98
## [1] "cost= 3000 gamma= 0.03 RMSE= 590.872000276642"
## user system elapsed
## 3.22 0.03 3.26
## [1] "cost= 3000 gamma= 0.01 RMSE= 641.037405812469"
## user system elapsed
## 3.01 0.09 3.11
## [1] "cost= 3000 gamma= 0.003 RMSE= 718.242021226803"
## user system elapsed
## 2.90 0.03 2.93
## [1] "cost= 3000 gamma= 0.001 RMSE= 828.126905711493"
## user system elapsed
## 2.82 0.05 2.88
## [1] "cost= 1000 gamma= 0.03 RMSE= 643.119698670216"
## user system elapsed
## 3.04 0.05 3.09
## [1] "cost= 1000 gamma= 0.01 RMSE= 710.254417758907"
## user system elapsed
## 2.93 0.05 2.98
## [1] "cost= 1000 gamma= 0.003 RMSE= 833.632354371629"
## user system elapsed
## 2.84 0.01 2.86
## [1] "cost= 1000 gamma= 0.001 RMSE= 930.955592008881"
## user system elapsed
## 2.85 0.02 2.87
## [1] "cost= 300 gamma= 0.03 RMSE= 731.563947877744"
## user system elapsed
## 2.90 0.01 2.92
## [1] "cost= 300 gamma= 0.01 RMSE= 837.389158289238"
## user system elapsed
## 2.84 0.02 2.86
## [1] "cost= 300 gamma= 0.003 RMSE= 940.627565613757"
## user system elapsed
## 2.85 0.01 2.87
## [1] "cost= 300 gamma= 0.001 RMSE= 990.310851555043"
## user system elapsed
## 2.82 0.03 2.86
## [1] "cost= 100 gamma= 0.03 RMSE= 852.952630309809"
## user system elapsed
## 2.83 0.02 2.84
## [1] "cost= 100 gamma= 0.01 RMSE= 935.742781446738"
## user system elapsed
## 2.81 0.03 2.84
## [1] "cost= 100 gamma= 0.003 RMSE= 990.364799384578"
## user system elapsed
## 2.86 0.00 2.85
## [1] "cost= 100 gamma= 0.001 RMSE= 1010.5041221811"
## user system elapsed
## 2.78 0.08 2.86
## [1] "cost= 30 gamma= 0.03 RMSE= 954.632985113189"
## user system elapsed
## 2.83 0.03 2.86
## [1] "cost= 30 gamma= 0.01 RMSE= 992.010400651147"
## user system elapsed
## 2.81 0.05 2.85
## [1] "cost= 30 gamma= 0.003 RMSE= 1009.40403678135"
## user system elapsed
## 2.81 0.04 2.85
## [1] "cost= 30 gamma= 0.001 RMSE= 1016.47614714052"
## user system elapsed
## 2.81 0.04 2.84
## [1] "cost= 10 gamma= 0.03 RMSE= 995.431099321153"
## user system elapsed
## 2.84 0.01 2.86
## [1] "cost= 10 gamma= 0.01 RMSE= 1009.68111754956"
## user system elapsed
## 2.77 0.06 2.84
## [1] "cost= 10 gamma= 0.003 RMSE= 1015.92927942376"
## user system elapsed
## 2.80 0.03 2.84
## [1] "cost= 10 gamma= 0.001 RMSE= 1018.15585025683"
## user system elapsed
## 2.80 0.04 2.84
Although the SVM algorithm does not converge on the numeric version of the training data, I have experimented with a scaled version of these data using the code below.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
scaledmodel <- function(storeNumber,modelType,...) {
store <- t[t$Store == storeNumber,2:10]
shuffledData <- store[sample(nrow(store)),]
shuffledData$Prediction <- 0
m1 <- mean(shuffledData$Sales)
shuffledData$Sales <- shuffledData$Sales - m1
m2 <- max(abs(shuffledData$Sales))
shuffledData$Sales <- shuffledData$Sales / m2
pp <- preProcess(store[-1])
shuffledData <- predict(pp,shuffledData)
z <- nrow(store)
for (i in 1:10) {
ind <- 1:z
sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
test <- shuffledData[sampleIndex,3:(ncol(shuffledData)-1)]
train <- shuffledData[-sampleIndex,c(1,3:(ncol(shuffledData)-1))]
n <- names(train)
f <- as.formula(paste("Sales ~", paste(n[n != "Sales"], collapse = " + ")))
modell <- modelType(f, train, ...)
shuffledData[sampleIndex,]$Prediction <- predict(modell,test) * m2 + m1
}
shuffledData$Sales <- shuffledData$Sales * m2 + m1
sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}
print(system.time(print(scaledmodel(1,svm))))
## [1] 891.551692
## user system elapsed
## 1.84 0.02 1.86
This does in fact converge and even helped to reduce the execution time for SVMs to around 3 seconds, but the predictions are much less accurate than with the binary data. I have used smaller C values in this case than with the binary model because when I tried values as high as 10000, the algorithm took extremely long (as long as 8 minutes depending on the gamma value) to converge while yielding no better results.
for (cst in c(300,100,30,10,1)) {
for (gm in c(0.1,0.03,0.01,0.003,0.001)) {
print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",scaledmodel(262,svm,cost=cst,gamma=gm)))))
}
}
## [1] "cost= 300 gamma= 0.1 RMSE= 2593.64232101338"
## user system elapsed
## 32.61 0.00 32.60
## [1] "cost= 300 gamma= 0.03 RMSE= 2296.93245132956"
## user system elapsed
## 10.96 0.02 10.99
## [1] "cost= 300 gamma= 0.01 RMSE= 2340.51431946316"
## user system elapsed
## 5.07 0.02 5.09
## [1] "cost= 300 gamma= 0.003 RMSE= 2450.27204333763"
## user system elapsed
## 3.23 0.00 3.23
## [1] "cost= 300 gamma= 0.001 RMSE= 2514.59037761978"
## user system elapsed
## 2.68 0.00 2.70
## [1] "cost= 100 gamma= 0.1 RMSE= 2475.16858490888"
## user system elapsed
## 11.50 0.01 11.51
## [1] "cost= 100 gamma= 0.03 RMSE= 2280.36807263334"
## user system elapsed
## 5.32 0.02 5.33
## [1] "cost= 100 gamma= 0.01 RMSE= 2347.97891652388"
## user system elapsed
## 3.38 0.00 3.39
## [1] "cost= 100 gamma= 0.003 RMSE= 2483.69624430045"
## user system elapsed
## 2.64 0.00 2.63
## [1] "cost= 100 gamma= 0.001 RMSE= 2548.64205748948"
## user system elapsed
## 2.61 0.00 2.61
## [1] "cost= 30 gamma= 0.1 RMSE= 2316.25440194878"
## user system elapsed
## 5.12 0.00 5.14
## [1] "cost= 30 gamma= 0.03 RMSE= 2264.54892406434"
## user system elapsed
## 3.23 0.00 3.23
## [1] "cost= 30 gamma= 0.01 RMSE= 2407.45275042331"
## user system elapsed
## 2.66 0.00 2.66
## [1] "cost= 30 gamma= 0.003 RMSE= 2510.30074596213"
## user system elapsed
## 2.50 0.01 2.52
## [1] "cost= 30 gamma= 0.001 RMSE= 2805.74003619912"
## user system elapsed
## 2.45 0.00 2.46
## [1] "cost= 10 gamma= 0.1 RMSE= 2360.49558725221"
## user system elapsed
## 3.23 0.03 3.28
## [1] "cost= 10 gamma= 0.03 RMSE= 2306.94297156483"
## user system elapsed
## 2.67 0.04 2.70
## [1] "cost= 10 gamma= 0.01 RMSE= 2495.09428331857"
## user system elapsed
## 2.51 0.00 2.51
## [1] "cost= 10 gamma= 0.003 RMSE= 2756.7848053233"
## user system elapsed
## 2.45 0.00 2.45
## [1] "cost= 10 gamma= 0.001 RMSE= 3121.15195949112"
## user system elapsed
## 2.36 0.00 2.35
## [1] "cost= 1 gamma= 0.1 RMSE= 2404.14928086344"
## user system elapsed
## 2.53 0.00 2.53
## [1] "cost= 1 gamma= 0.03 RMSE= 2727.48736148838"
## user system elapsed
## 2.39 0.00 2.38
## [1] "cost= 1 gamma= 0.01 RMSE= 3104.95917566333"
## user system elapsed
## 2.32 0.00 2.33
## [1] "cost= 1 gamma= 0.003 RMSE= 3846.94688442779"
## user system elapsed
## 2.36 0.00 2.36
## [1] "cost= 1 gamma= 0.001 RMSE= 4404.55866863145"
## user system elapsed
## 2.37 0.01 2.39
Therefore I will only use the basic binary model function to train SVMs.
The above models used the default kernel for SVMs in R, which is the radial basis kernel. I have also experimented with the other available kernels. Pretty good predictions are generated using linear kernels when C is set relatively low (to 1 or below). Execution time is prohibitively long when C is close to 1 or even higher).
for (cst in c(1,0.5,0.1,0.05,0.01,0.003,0.001)) {
print(system.time(print(paste("cost=",cst,"RMSE=",binarymodel(262,svm,cost=cst,kernel="linear")))))
}
## [1] "cost= 1 RMSE= 1844.65110177669"
## user system elapsed
## 41.03 0.13 41.15
## [1] "cost= 0.5 RMSE= 1862.54272709356"
## user system elapsed
## 23.26 0.06 23.33
## [1] "cost= 0.1 RMSE= 1850.25702287523"
## user system elapsed
## 9.02 0.08 9.11
## [1] "cost= 0.05 RMSE= 1839.61052805445"
## user system elapsed
## 7.13 0.04 7.22
## [1] "cost= 0.01 RMSE= 1854.77693246435"
## user system elapsed
## 5.72 0.05 5.77
## [1] "cost= 0.003 RMSE= 2184.88577928492"
## user system elapsed
## 5.60 0.06 5.69
## [1] "cost= 0.001 RMSE= 3112.03226993571"
## user system elapsed
## 5.55 0.05 5.61
For other stores, however, predictions based on linear kernels are worse than average.
for (cst in c(1,0.5,0.1,0.05,0.01,0.003,0.001)) {
print(system.time(print(paste("cost=",cst,"RMSE=",binarymodel(1,svm,cost=cst,kernel="linear")))))
}
## [1] "cost= 1 RMSE= 970.148233347503"
## user system elapsed
## 1.61 0.03 1.66
## [1] "cost= 0.5 RMSE= 995.299056098023"
## user system elapsed
## 1.60 0.05 1.66
## [1] "cost= 0.1 RMSE= 1014.28182471904"
## user system elapsed
## 1.59 0.06 1.66
## [1] "cost= 0.05 RMSE= 1017.41535968184"
## user system elapsed
## 1.61 0.02 1.62
## [1] "cost= 0.01 RMSE= 1019.64941504274"
## user system elapsed
## 1.59 0.06 1.67
## [1] "cost= 0.003 RMSE= 1021.03898011506"
## user system elapsed
## 1.56 0.08 1.64
## [1] "cost= 0.001 RMSE= 1019.74808959326"
## user system elapsed
## 1.59 0.04 1.63
I have not been able to achieve competitive results with polynomial kernels under any settings, so I will not discuss them here.
As I mentioned when presenting an overview of the data, there is a second data table called store.csv which contains information on the type of store, whether there is a competition store close to it, and whether the store participates in a promotion programme. This is a typical row from that table.
s[1,]
## Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth
## 1 1 c a 1270 9
## CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear
## 1 2008 FALSE NA NA
## PromoInterval
## 1
I assume that certain pieces of information included here are certainly redundant and less specific than the data available from the training table. No doubt the StoreType, CompetitionDistance and Assortment variables have an effect on the sales of a store. Theoretically, these could be used to impute sales for stores that we have little historical data on. However, this will not be necessary since we seem to have sufficient historical data for all stores. The opening date of the competition store and the start of the second type of promo might have some effect on sales. In order to get an impression of whether this effect indeed exists, I create a new logical variable for the existence of a competition store and participation in the second promotion respectively, and then set this variable to true for all dates after the starting dates indicated in the stores table.
t$Competition <- F
t$Promo2 <- F
for (i in 1:nrow(s)) {
if (!is.na(s[i,5])) {
m <- s[i,5]
y <- s[i,6] - 2000
t[t$Store == i,]$Competition <- t[t$Store == i,]$DateYear > y | (t[t$Store == i,]$DateMonth >= m & t[t$Store == i,]$DateYear == y)
}
if (s[i,7]) {
w <- s[i,8]
y <- s[i,9] - 2000
t[t$Store == i,]$Promo2 <- t[t$Store == i,]$DateYear > y | (t[t$Store == i,]$DateWeek >= w & t[t$Store == i,]$DateYear == y)
}
}
Next I check how these two variables affect sales.
t.test(t[t$Promo2,]$Sales,t[!t$Promo2,]$Sales)
##
## Welch Two Sample t-test
##
## data: t[t$Promo2, ]$Sales and t[!t$Promo2, ]$Sales
## t = -81.052822, df = 633758.63, p-value < 0.00000000000000022204
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -605.1260330 -576.5514605
## sample estimates:
## mean of x mean of y
## 6548.138444 7138.977191
Unexpectedly, participation of the store in the second type of promotions is correlated with drastically lower sales (by about 10%). The difference is highly significant.
t.test(t[t$Competition,]$Sales,t[!t$Competition,]$Sales)
##
## Welch Two Sample t-test
##
## data: t[t$Competition, ]$Sales and t[!t$Competition, ]$Sales
## t = -24.192057, df = 597683.09, p-value < 0.00000000000000022204
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -195.0011589 -165.7723695
## sample estimates:
## mean of x mean of y
## 6774.203593 6954.590358
Of our binary variables, Competition correlates least strongly with differences in sales (the t value is closest to 0 here). Although the effect that in the presence of competition sales are lower is statistically highly significant, the magnitude of this difference is negligible (about 1 per cent).
Having added these further variables to our training data, I test the new data on the usual two stores using a linear model.
storelm2(1)
## [1] 602.7654652
storelm2(262)
## [1] 2863.729623
numericmodel(1,randomForest,ntree=200)
## [1] 649.189655
numericmodel(262,randomForest,ntree=200)
## [1] 2081.333607
I retest the predictions. As expected, we see a slight improvement in the average errors per store using random forests.
errors <- rep(0,length(teststores))
j <- 0
for ( i in teststores ) {
errors[j] <- numericmodel(i,randomForest,ntree=200)
j <- j + 1
}
mean(errors)
## [1] 1011.349904
max(errors)
## [1] 2749.477975
Since these two further variables lead to slightly more accurate predictions, I include them in the binary version of the data as well.
tb$Promo2TRUE <- as.numeric(t$Promo2)
tb$CompetitionTRUE <- as.numeric(t$Competition)
It is worth exploring whether PCA could lead to an improvement of either accuracy or at least processing speed. Let us examine first the contribution that the principal components make to the data, using the binary version of the data. This version of the data set has 108 predictive variables. The rank of the data matrix is 107.
store <- tb[tb$Store == 262,4:ncol(tb)]
ncol(store)
## [1] 108
library(Matrix)
qr(store)$rank
## [1] 107
Above I have already selected a particular store (262) for the PCA. First I plot the standard deviations that the ordered principal components account for in the data.
pc <- prcomp(store)
plot(pc$sdev,type="l", xlab="Principal Components", ylab="Standard Deviations")
The first four components are far higher in the first plot than all others, and the standard deviations fall sharply until we get to around component 20. From then on the standard deviations are relatively flat to about component 80, where we again see them start to drop.
Second I compute a cumulative sum of these standard deviations, divide these by the total sum of standard deviations and plot the result.
plot(cumsum(pc$sdev / sum(pc$sdev)),type="l", xlab="Principal Components", ylab="Cumulative Standard Deviation Ratio")
This plot shows an almost linear curve. The cumulative contribution of the components rises steadily throughout the curve and only flattens out around the end. This indicates that there is no obvious point that would suggest itself as an ideal choice for dimensionality reduction. It also suggests that we are unlikely to be able to reduce the dimensionality of the data and thereby gain improvements in processing speed without sacrificing accuracy, since basically all but the last few components add to the cumulative curve. Thus, even though the first 20 or so components should be sufficient for rough estimates of sales, we should see a gradual but tangible improvement of prediction accuracy up to the last components. To confirm this, I prepare a version of the storelm2 function which I used to predict sales using a linear model above. This new version incorporates PCA with dimension reduction, the second argument being the dimension to which we reduce the data.
lmwithpca <- function(storeNumber,dim) {
store <- tb[tb$Store == storeNumber,2:ncol(tb)] # the store number factor variable is removed since it is not needed
shuffledData <- store[sample(nrow(store)),]
shuffledData$Prediction <- 0
z <- nrow(store)
for (i in 1:10) {
ind <- 1:z
sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
test <- shuffledData[sampleIndex,]
train <- shuffledData[-sampleIndex,1:(ncol(shuffledData)-1)]
trank <- qr(train[3:ncol(train)])$rank
pc <- prcomp(train[3:ncol(train)])
pctrain <- cbind(train[1],pc$x)[1:min(trank,dim)] # The dimension is reduced to dim, but if the rank of the training data is lower, then it is reduced to that rank. This takes care of the rank-deficient data problem that arose in earlier versions of this function.
pctest <- as.data.frame(predict(pc,test[3:ncol(test)-1]))[1:min(trank,dim)]
modell <- lm(Sales ~ ., pctrain)
shuffledData[sampleIndex,]$Prediction <- predict(modell,pctest)
}
sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}
I first calculate a baseline RMSE using the original function without pca. Then I loop the new function with PCA with dimensions set 9 to 99 with steps of 10, and plot the resulting RMSE scores.
print(system.time(print(baseline <- storelm2(262))))
## [1] 2500.464874
## user system elapsed
## 0.82 0.00 0.82
results <- rep(1,10)
for (i in seq(9,99,10)) {
print(system.time(print(results[(i+1)/10] <- lmwithpca(262,i))))
}
## [1] 3553.404564
## user system elapsed
## 1.15 0.03 1.20
## [1] 2901.212728
## user system elapsed
## 1.15 0.03 1.20
## [1] 2755.764757
## user system elapsed
## 1.28 0.04 1.36
## [1] 2791.883354
## user system elapsed
## 1.14 0.08 1.22
## [1] 2802.954744
## user system elapsed
## 1.17 0.09 1.26
## [1] 2815.983384
## user system elapsed
## 1.27 0.02 1.29
## [1] 2803.769114
## user system elapsed
## 1.27 0.08 1.34
## [1] 2700.634346
## user system elapsed
## 1.33 0.06 1.40
## [1] 2322.162164
## user system elapsed
## 1.34 0.05 1.38
## [1] 1911.203792
## user system elapsed
## 1.40 0.04 1.46
plot(seq(9,99,10),results)
points(c(9,99),c(baseline,baseline),"l",col="green")
Although the process is randomised and will differ slightly each time, a typical result is a curve with the RMSE starting out very high with 9 principal components, dropping sharply in the next step at 19, staying about level until 79, and then again dropping sharply for 89 and 99. I have also checked values for 99 to 107, but there is no further consistent gain in that range.
However, what is interesting is the position of the baseline in relation to the PCA-generated data. The flat part of the curve runs at around the same height but usually below the baseline, whereas the high-dimensional (89 and 99) approximations of the data that we get through PCA are far better than the baseline.
Therefore I construct an abstract function that first transforms the data by using PCA and then trains any model on it to generate predictions for stores
modelwithpca <- function(storeNumber,modelType,dim=100,...) {
store <- tb[tb$Store == storeNumber,2:ncol(tb)] # the store number factor variable is removed since it is not needed
shuffledData <- store[sample(nrow(store)),]
shuffledData$Prediction <- 0
z <- nrow(store)
for (i in 1:10) {
ind <- 1:z
sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
test <- shuffledData[sampleIndex,]
train <- shuffledData[-sampleIndex,1:(ncol(shuffledData)-1)]
trank <- qr(train[3:ncol(train)])$rank
pc <- prcomp(train[3:ncol(train)])
pctrain <- cbind(train[1],pc$x)[1:min(trank,dim)]
pctest <- as.data.frame(predict(pc,test[3:ncol(test)-1]))[1:min(trank,dim)]
modell <- modelType(Sales ~ ., pctrain,...)
shuffledData[sampleIndex,]$Prediction <- predict(modell,pctest)
}
sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}
Using this function the errors for the complete range of stores could be calculated like we did above, although with this particular model the results would certainly be good but not exceptionally so.
This function could in principle be used with random forests, but this does not yield good results. The binary data delivered somewhat worse error scores than the excellent (and so far best) results gained with random forests on numeric data. Also, running the function takes extremely long despite the rather mediocre results.
print(system.time(print(modelwithpca(262,randomForest,100,ntree=100))))
## [1] 2252.886994
## user system elapsed
## 30.25 0.03 30.73
I have also created a modified version of the PCA-based function that uses numeric data. However, this function combined with random forests delivers far worse results than the simple randomForest function without PCA.
numericwithpca <- function(storeNumber,modelType,dim=10,...) {
store <- t[t$Store == storeNumber,2:ncol(t)] # the store number factor variable is removed since it is not needed
store$Promo <- as.numeric(store$Promo)
store$StateHoliday <- as.numeric(store$StateHoliday)
store$SchoolHoliday <- as.numeric(store$SchoolHoliday)
store$DateWeek <- as.numeric(store$DateWeek)
store$DayOfWeek <- as.numeric(store$DayOfWeek)
shuffledData <- store[sample(nrow(store)),]
shuffledData$Prediction <- 0
z <- nrow(store)
for (i in 1:10) {
ind <- 1:z
sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
test <- shuffledData[sampleIndex,2:(ncol(shuffledData)-1)]
train <- shuffledData[-sampleIndex,1:(ncol(shuffledData)-1)]
trank <- qr(train[2:ncol(train)])$rank
pc <- prcomp(train[2:ncol(train)])
pctrain <- cbind(train[1],pc$x)[1:min(trank,dim)]
pctest <- as.data.frame(predict(pc,test))[1:min(trank,dim)]
modell <- modelType(Sales ~ ., pctrain)
shuffledData[sampleIndex,]$Prediction <- predict(modell,pctest)
}
sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}
numericwithpca(262,randomForest,ntree=200)
## [1] 2404.263708
numericmodel(262,randomForest,ntree=200)
## [1] 2157.920224
At the same time, transforming the data with PCA significantly improves results with SVMs over the results gained with the binarymodel function.
for (cst in c(300,100,30,10)) {
for (gm in c(0.003,0.001)) {
print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",modelwithpca(1,svm,cost=cst,gamma=gm)))))
}
}
## [1] "cost= 300 gamma= 0.003 RMSE= 538.169490760043"
## user system elapsed
## 7.52 0.12 7.76
## [1] "cost= 300 gamma= 0.001 RMSE= 553.180007040596"
## user system elapsed
## 10.17 0.09 10.32
## [1] "cost= 100 gamma= 0.003 RMSE= 537.645840475199"
## user system elapsed
## 7.64 0.07 7.85
## [1] "cost= 100 gamma= 0.001 RMSE= 557.760186938175"
## user system elapsed
## 8.44 0.04 8.67
## [1] "cost= 30 gamma= 0.003 RMSE= 537.796729867815"
## user system elapsed
## 7.31 0.03 7.47
## [1] "cost= 30 gamma= 0.001 RMSE= 589.611565644862"
## user system elapsed
## 7.01 0.05 7.14
## [1] "cost= 10 gamma= 0.003 RMSE= 546.482623881485"
## user system elapsed
## 6.67 0.03 6.76
## [1] "cost= 10 gamma= 0.001 RMSE= 610.304247978315"
## user system elapsed
## 6.19 0.00 6.19
for (cst in c(300,100,30,10)) {
for (gm in c(0.003,0.001)) {
print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",modelwithpca(262,svm,cost=cst,gamma=gm)))))
}
}
## [1] "cost= 300 gamma= 0.003 RMSE= 1852.57624384181"
## user system elapsed
## 8.97 0.10 9.20
## [1] "cost= 300 gamma= 0.001 RMSE= 1826.00936487394"
## user system elapsed
## 12.42 0.06 12.80
## [1] "cost= 100 gamma= 0.003 RMSE= 1906.82882216824"
## user system elapsed
## 9.69 0.02 10.03
## [1] "cost= 100 gamma= 0.001 RMSE= 1803.28632013106"
## user system elapsed
## 10.92 0.03 11.12
## [1] "cost= 30 gamma= 0.003 RMSE= 1877.85988409389"
## user system elapsed
## 9.06 0.02 9.37
## [1] "cost= 30 gamma= 0.001 RMSE= 1827.68605412662"
## user system elapsed
## 9.28 0.04 9.56
## [1] "cost= 10 gamma= 0.003 RMSE= 1811.04605728556"
## user system elapsed
## 8.47 0.02 8.61
## [1] "cost= 10 gamma= 0.001 RMSE= 1786.56345958342"
## user system elapsed
## 8.18 0.04 8.41
We have established that there are two methods that deliver the best results with cross-validation on the training data set:
However, the calculation of both models is randomized by definition, so one should make sure to generate results that are consistently as good as possible. For this reason I will combine predictions generated using these models. Firstly, I will fit m=5 random forests to each shop’s training data and generate predictions using them. Then I will fit n=5 SVMs to the data. I will then check the RMSEs for each of these models, and establish the best way to combine them (e.g. take the median or the mean of all models, take the model that does best on the training data, etc.).
combineRFForStore <- function(storeNumber,m=5) {
storen <- t[t$Store == storeNumber,2:ncol(t)]
storen$Promo <- as.numeric(storen$Promo)
storen$StateHoliday <- as.numeric(storen$StateHoliday)
storen$SchoolHoliday <- as.numeric(storen$SchoolHoliday)
storen$DateWeek <- as.numeric(storen$DateWeek)
storen$DayOfWeek <- as.numeric(storen$DayOfWeek)
salesAndPredictions <- storen[1]
shuffledData <- storen[sample(nrow(storen)),]
z <- nrow(storen)
shuffledData$Prediction <- 0
shuffledData$Residual <- 0
shuffledData$Error <- 0
for (k in 1:m) {
for (i in 1:10) {
ind <- 1:z
sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
test <- shuffledData[sampleIndex,]
train <- shuffledData[-sampleIndex,(1:(ncol(shuffledData)-3))]
modell <- randomForest(Sales ~ ., train, ntree=200)
shuffledData[sampleIndex,]$Prediction <- predict(modell,test)
shuffledData[sampleIndex,]$Residual <- mean(sqrt(modell$mse))
shuffledData[sampleIndex,]$Error <- abs(shuffledData[sampleIndex,]$Sales - shuffledData[sampleIndex,]$Prediction)
}
salesAndPredictions <- cbind(salesAndPredictions,shuffledData[order(as.numeric(row.names(shuffledData))),(ncol(shuffledData)-2) : ncol(shuffledData)])
names(salesAndPredictions)[(ncol(salesAndPredictions)-2):ncol(salesAndPredictions)] <- c(paste0("P",k),paste0("R",k),paste0("E",k))
shuffledData <- shuffledData[sample(nrow(shuffledData)),]
}
salesAndPredictions
}
Calling the above code on a store number gives us a table with columns that contain the predictions, residual errors for each model, and prediction errors (i.e. distance between the sales and prediction in each row), for 5 passes of cross-validation.
sp <- combineRFForStore(1)
head(sp)
## Sales P1 R1 E1 P2 R2 E2
## 1 5263 FALSE 562.5627345 824.73022616 FALSE 555.2929919 455.8934799
## 1116 5020 FALSE 571.2980274 30.53735213 FALSE 571.9360165 105.1318442
## 2231 4782 FALSE 587.5445638 402.31782397 FALSE 571.9360165 681.3112589
## 3346 5011 FALSE 548.7851509 119.38346933 FALSE 580.1206878 234.6015614
## 4461 6102 FALSE 580.5355484 729.79520106 FALSE 585.7268193 763.1129744
## 6691 4364 FALSE 571.2980274 220.77352842 FALSE 580.1206878 203.6818098
## P3 R3 E3 P4 R4 E4 P5
## 1 FALSE 556.9734689 455.56101592 FALSE 585.9847594 592.0242026 FALSE
## 1116 FALSE 556.9734689 80.09322094 FALSE 557.1926099 274.1452736 FALSE
## 2231 FALSE 592.3479034 745.17798335 FALSE 557.1926099 766.1443440 FALSE
## 3346 FALSE 592.3479034 392.48673976 FALSE 557.1926099 466.1901211 FALSE
## 4461 FALSE 594.4332079 774.47792923 FALSE 558.3101120 741.1280745 FALSE
## 6691 FALSE 576.0982377 179.47072841 FALSE 575.8490905 248.3175665 FALSE
## R5 E5
## 1 587.2957511 518.5154671
## 1116 589.5290878 346.5003603
## 2231 587.2957511 608.5533910
## 3346 594.0160907 218.2420557
## 4461 594.0160907 841.2543422
## 6691 594.0160907 137.7047735
First I investigate the question whether there should be a preference for models based on the value of their mean residual error on the data on which they are trained. In order to do this, I combine all residual errors (for all 50 models) into a single column, the prediction errors (for all 50 models) into another column and square them, and then calculate the mean squared error score for each residual error value.
errorsVsResiduals <- cbind(c(sp$R1,sp$R2,sp$R3,sp$R4,sp$R5),c(sp$E1,sp$E2,sp$E3,sp$E4,sp$E5))
errorsVsResiduals[,2] <- errorsVsResiduals[,2]^2
ag <- aggregate(errorsVsResiduals[,2],list(errorsVsResiduals[,1]),mean)
ag[1:10,]
## Group.1 x
## 1 533.2287718 831471.1400
## 2 545.8299409 609595.4572
## 3 548.7851509 463936.7054
## 4 553.3918313 330278.6786
## 5 555.2929919 419684.6768
## 6 556.9734689 627015.5655
## 7 557.1926099 523248.6238
## 8 558.3101120 516543.2643
## 9 562.5627345 579948.4461
## 10 564.6136231 391730.6034
plot(ag[,1],sqrt(ag[,2]),xlab="Training Residual Error",ylab="Prediction RMSE")
names(ag) <- c("Resid","MSE")
l <- lm(sqrt(MSE) ~ Resid, ag)
points(c(min(ag$Resid),max(ag$Resid)),predict(l,data.frame(Resid=c(min(ag$Resid),max(ag$Resid)))),type="l",col="green")
I have plotted the prediction root mean squared errors (vertical axis) against the residual errors for the models that generated the corresponding predictions (horizontal axis). There is no clear tendency visible in the plot. The residual errors are clearly concentrated between 570 and 590, but the prediction RMSEs for them are spread rather evenly between 450 and 650. Very low RMSEs are seen with residual errors as low as 575 and as high as 590. I have also fitted a linear model to these data and plotted the curve for that model (the green line). It shows a clear drop in prediction RMSE as the residual error decreases. This seems counterintuitive at first but is not completely unexpected. This finding indicates that lower residual error is likely due to overfitting on the training data and correlates with lower accuracy on the test data. This suggests that preference should be given to models that have relatively high residual error. I have repeated this trial several times both on the same store and on others, and the pattern was always the same.
As I mentioned before, another approach is to investigate how the five predictions I have generated for each row of the training table can be combined. I am going to try several possible solutions: take the mean, the harmonic and geometric mean, the median, the minimum, the maximum of the five predictions, or randomly choose one of them. I then calculate RMSE scores for each of the resulting derived predictions.
calculateErrors <- function(sp) {
sps <- sp[1] # initialize sps with the Sales column from sp
sps$Mean <- apply(sp[,seq(2,ncol(sp)-2,3)],1,mean)
sps$HarmonicMean <- apply(sp[,seq(2,ncol(sp)-2,3)],1,function(x) {1/mean(1/x)})
sps$GeometricMean <- apply(sp[,seq(2,ncol(sp)-2,3)],1,function(x) {prod(x)^(1/length(x))})
sps$Median <- apply(sp[,seq(2,ncol(sp)-2,3)],1,median)
sps$Min <- apply(sp[,seq(2,ncol(sp)-2,3)],1,min)
sps$Max <- apply(sp[,seq(2,ncol(sp)-2,3)],1,max)
sps$Random <- apply(sp[,seq(2,ncol(sp)-2,3)],1,function(x) {x[sample(1:length(x),1)]})
sps$RMax <- apply(sp[,seq(3,ncol(sp)-1,3)],1,max)
sps$RMaxIndex <- apply(cbind(sp[,seq(3,ncol(sp)-1,3)],sps[ncol(sps)]),1,function(x){2+((which(x == x[length(x)])[1])-1)*3})
sps$WorstResidual <- apply(cbind(sp,sps[ncol(sps)]),1,function(x){x[x[length(x)]]})
errors <- rep(0,8)
names(errors) = c("Mean","HarmonicMean","GeometricMean","Median","Min","Max","Random","WorstResidual")
errors[1] <- sqrt(mean((sps$Sales-sps$Mean)^2))
errors[2] <- sqrt(mean((sps$Sales-sps$HarmonicMean)^2))
errors[3] <- sqrt(mean((sps$Sales-sps$GeometricMean)^2))
errors[4] <- sqrt(mean((sps$Sales-sps$Median)^2))
errors[5] <- sqrt(mean((sps$Sales-sps$Min)^2))
errors[6] <- sqrt(mean((sps$Sales-sps$Max)^2))
errors[7] <- sqrt(mean((sps$Sales-sps$Random)^2))
errors[8] <- sqrt(mean((sps$Sales-sps$WorstResidual)^2))
errors
}
calculateErrors(sp)
## Mean HarmonicMean GeometricMean Median Min
## 4865.392356 4865.392356 4865.392356 4865.392356 4865.392356
## Max Random WorstResidual
## 4865.392356 4865.392356 4865.392356
The random value achieves essentially the same as generating any single prediction without combining the available several predictions, so we can consider it a baseline number. The three mean values are usually better than the random choice and about equal. Always choosing the minimum or maximum usually leads to the worst results, the minimum being usually worse than the maximum, although this might depend on the selected store. Choosing the median of the available predictions consistently yields the best result. Choosing the prediction that was generated by the model with the worst residual error is relatively difficult to accomplish and leads to relatively poor results. These are usually better than choosing a random value but worse than choosing the mean or median. Again I have repeated this trial for several different stores, and the results are consistently quite similar.
I also loop over the first ten stores, with 9 predictions generated for each day, to confirm that the above generalizations hold true with a somewhat bigger test set. It follows that the most accurate predictions are achieved by always selecting the median prediction.
sp <- combineRFForStore(1,9)
for (x in teststores[2:10]) {
sp <- rbind(sp, combineRFForStore(x,9))
}
calculateErrors(sp)
## Mean HarmonicMean GeometricMean Median Min
## 6844.227165 6844.227165 6844.227165 6844.227165 6844.227165
## Max Random WorstResidual
## 6844.227165 6844.227165 6844.227165
Increasing the number of random forests that are combined is useful, but does not result in further improvement after a point. I loop over several calls of the combineRFForStore function (with store 262), gradually increase the number of random forests, measure time and collect the median scores in a list. Finally, I plot them.
medians <- rep(0,10)
for (i in seq(3,30,3)) {
print(system.time(sp <- combineRFForStore(262,i)))
e <- calculateErrors(sp)
medians[i/3] <- e["Median"]
}
## user system elapsed
## 15.99 0.13 16.35
## user system elapsed
## 30.58 0.26 31.19
## user system elapsed
## 47.01 0.31 47.88
## user system elapsed
## 61.98 0.38 63.48
## user system elapsed
## 79.57 0.67 81.52
## user system elapsed
## 97.04 0.70 99.78
## user system elapsed
## 112.59 0.79 115.80
## user system elapsed
## 120.84 0.82 121.92
## user system elapsed
## 135.97 0.94 137.11
## user system elapsed
## 152.60 0.88 153.72
plot(seq(3,30,3),medians,xlab="Number of RF-based Predictions", ylab="Median of RMSEs")
The time needed to execute the function rises linearly as expected. Fitting the 10 random forests that are used (through 10-fold cross-validation) to generate the predictions takes about 5 seconds. Fitting 30 of these takes consequently about 3 minutes. The results are random, but generating three predictions for each day usually results in the highest error for the median predictions. Six predictions are usually almost as good as the bigger sets. There does not seem to be a clear improvement between 9 and 30. A choice of 9 random forest models seems to be a good choice which balances execution time (which is relatively fast) and quality (which is not substantially worse than anything else in the 9 to 30 range).
The code to combine SVM models is presented below:
combineSVMForStore <- function(storeNumber,n=5, c=30) {
storeb <- tb[tb$Store == storeNumber,2:ncol(tb)]
salesAndPredictions <- storeb[1]
shuffledData <- storeb[sample(nrow(storeb)),]
shuffledData$Prediction <- 0
shuffledData$Residual <- 0
shuffledData$Error <- 0
z <- nrow(storeb)
for (k in 1:n) {
for (i in 1:10) {
ind <- 1:z
sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
test <- shuffledData[sampleIndex,]
train <- shuffledData[-sampleIndex,(1:(ncol(shuffledData)-1))]
trank <- qr(train[3:ncol(train)])$rank
pc <- prcomp(train[3:ncol(train)])
pctrain <- cbind(train[1],pc$x)[1:min(trank,100)]
pctest <- as.data.frame(predict(pc,test[3:ncol(test)-1]))[1:min(trank,100)]
modell <- svm(Sales ~ ., pctrain, cost=c,gamma=0.001)
shuffledData[sampleIndex,]$Prediction <- predict(modell,pctest)
shuffledData[sampleIndex,]$Residual <- sqrt(mean((modell$residuals)^2))
shuffledData[sampleIndex,]$Error <- abs(shuffledData[sampleIndex,]$Sales - shuffledData[sampleIndex,]$Prediction)
}
salesAndPredictions <- cbind(salesAndPredictions,shuffledData[order(as.numeric(row.names(shuffledData))),(ncol(shuffledData)-2) : ncol(shuffledData)])
shuffledData <- shuffledData[sample(nrow(shuffledData)),]
names(salesAndPredictions)[(ncol(salesAndPredictions)-2):ncol(salesAndPredictions)] <- c(paste0("P",k),paste0("R",k),paste0("E",k))
}
salesAndPredictions
}
This function is tested on a few stores, comparing the random forest results with the SVM results for the same stores.
for (i in c(1,128,251,262,562)) {
print(paste("SVM for store",i))
print(system.time(sp <- combineSVMForStore(i,9)))
print(calculateErrors(sp))
print(paste("RF for store",i))
print(system.time(sp <- combineRFForStore(i,9)))
print(calculateErrors(sp))
}
## [1] "SVM for store 1"
## user system elapsed
## 58.66 0.14 58.84
## Mean HarmonicMean GeometricMean Median Min
## 547.0483759 544.6061618 545.0969552 542.7071117 624.2674212
## Max Random WorstResidual
## 943.6167381 602.4919114 597.2074990
## [1] "RF for store 1"
## user system elapsed
## 32.09 0.24 32.37
## Mean HarmonicMean GeometricMean Median Min
## 4865.392356 4865.392356 4865.392356 4865.392356 4865.392356
## Max Random WorstResidual
## 4865.392356 4865.392356 4865.392356
## [1] "SVM for store 128"
## user system elapsed
## 57.55 0.06 57.68
## Mean HarmonicMean GeometricMean Median Min
## 985.7672645 946.1393057 958.7544588 943.1256068 1021.5017013
## Max Random WorstResidual
## 2890.6716138 1414.6923838 998.8589337
## [1] "RF for store 128"
## user system elapsed
## 31.86 0.28 32.16
## Mean HarmonicMean GeometricMean Median Min
## 7756.266553 7756.266553 7756.266553 7756.266553 7756.266553
## Max Random WorstResidual
## 7756.266553 7756.266553 7756.266553
## [1] "SVM for store 251"
## user system elapsed
## 56.81 0.09 56.98
## Mean HarmonicMean GeometricMean Median Min
## 1925.688837 1877.580218 1893.743699 1853.206990 2004.101233
## Max Random WorstResidual
## 5285.922911 2758.693444 1917.046215
## [1] "RF for store 251"
## user system elapsed
## 31.82 0.29 32.13
## Mean HarmonicMean GeometricMean Median Min
## 19448.94177 19448.94177 19448.94177 19448.94177 19448.94177
## Max Random WorstResidual
## 19448.94177 19448.94177 19448.94177
## [1] "SVM for store 262"
## user system elapsed
## 74.09 0.09 74.28
## Mean HarmonicMean GeometricMean Median Min
## 1891.580373 1858.773258 1871.792007 1833.505411 1892.246965
## Max Random WorstResidual
## 4296.687053 2294.992818 1865.602810
## [1] "RF for store 262"
## user system elapsed
## 49.56 0.36 50.01
## Mean HarmonicMean GeometricMean Median Min
## 21237.5182 21237.5182 21237.5182 21237.5182 21237.5182
## Max Random WorstResidual
## 21237.5182 21237.5182 21237.5182
## [1] "SVM for store 562"
## user system elapsed
## 78.99 0.13 79.37
## Mean HarmonicMean GeometricMean Median Min
## 1385.510099 1384.470882 1384.847637 1389.680890 1517.816718
## Max Random WorstResidual
## 1735.160049 1463.936827 1406.316113
## [1] "RF for store 562"
## user system elapsed
## 43.48 0.33 43.84
## Mean HarmonicMean GeometricMean Median Min
## 18206.92248 18206.92248 18206.92248 18206.92248 18206.92248
## Max Random WorstResidual
## 18206.92248 18206.92248 18206.92248
Surprisingly, although random forests consistently seemed to deliver superior results when compared to support vector machines when we looked at the RMSEs of single predictions, combining several SVM-based predictions drastically reduces the error and leads to results that are much better than what we have seen so far in many shops (1, 128, 251), whereas the reverse is true in others (262, 562). I have run these trials several times, and they clearly suggest that the behavior is consistent within shops. Thus it appears that it makes sense to determine which shops work better with which model, and then generate predictions with the better type of model; or to combine SVM and RF predictions.
Unfortunately, SVMs are also slower to fit; more specifically, it takes approximately twice as long to fit them. Still, the improvement in accuracy that is seen in some shops more than makes up for this deficiency.
medians <- rep(0,5)
for (i in seq(3,15,3)) {
print(system.time(sp <- combineSVMForStore(262,i)))
e <- calculateErrors(sp)
medians[i/3] <- e["Median"]
}
## user system elapsed
## 24.79 0.06 24.90
## user system elapsed
## 49.50 0.05 49.61
## user system elapsed
## 73.72 0.09 73.98
## user system elapsed
## 98.47 0.14 98.81
## user system elapsed
## 123.13 0.23 123.46
plot(seq(3,15,3),medians,xlab="Number of SVM-based Predictions", ylab="Median of RMSEs")
Increasing the number of SVM-based predictions that are combined is only useful up to around 6 iterations, but does not result in further improvement after that.
Next I experiment with generating 9 SVM-based predictions with C = 30, 100 and 1000 respectively, and 9 RM-based predictions for each row and combining these in various ways.
best <- rep(0,teststores[15])
spsvm <- combineSVMForStore(1,9,30)
print(e1 <- calculateErrors(spsvm))
## Mean HarmonicMean GeometricMean Median Min
## 556.2173335 548.4557681 551.2186891 544.0209649 622.8145605
## Max Random WorstResidual
## 1190.4424576 649.2840112 569.7104359
spsvm <- cbind(rep(1,nrow(spsvm)),spsvm)
names(spsvm)[1] <- "Store"
spsvm2 <- combineSVMForStore(1,9,100)
print(e2 <- calculateErrors(spsvm2))
## Mean HarmonicMean GeometricMean Median Min
## 508.1176035 495.2274721 499.7816944 489.2613421 570.5219942
## Max Random WorstResidual
## 1238.6375621 615.6683465 578.7358995
spsvm3 <- combineSVMForStore(1,9,1000)
print(e3 <- calculateErrors(spsvm3))
## Mean HarmonicMean GeometricMean Median Min
## 481.8217780 480.5393366 480.7184976 478.5506778 601.6078272
## Max Random WorstResidual
## 751.1668609 519.7521619 502.6546366
sprf <- combineRFForStore(1,9)
print(e4 <- calculateErrors(sprf))
## Mean HarmonicMean GeometricMean Median Min
## 4865.392356 4865.392356 4865.392356 4865.392356 4865.392356
## Max Random WorstResidual
## 4865.392356 4865.392356 4865.392356
es <- c(e1[4],e2[4],e3[4],e4[4])
if (e1[4] == min(es)) {
best[1] <- 1
} else {
if (e2[4] == min(es)) {
best[1] <- 2
} else {
if (e3[4] == min(es)) {
best[1] <- 3
} else {
best[1] <- 4
}
}
}
for (i in teststores[2:15]) {
print(paste("SVM(1) for store",i))
sp <- combineSVMForStore(i,9,30)
print(e1 <- calculateErrors(sp))
sp <- cbind(rep(i,nrow(sp)),sp)
names(sp)[1] <- "Store"
spsvm <- rbind(spsvm,sp)
print(paste("SVM(2) for store",i))
sp <- combineSVMForStore(i,9,100)
print(e2 <- calculateErrors(sp))
spsvm2 <- rbind(spsvm2,sp)
print(paste("SVM(3) for store",i))
sp <- combineSVMForStore(i,9,1000)
print(e3 <- calculateErrors(sp))
spsvm3 <- rbind(spsvm3,sp)
print(paste("RF for store",i))
sp <- combineRFForStore(i,9)
print(e4 <- calculateErrors(sp))
sprf <- rbind(sprf,sp)
es <- c(e1[4],e2[4],e3[4],e4[4])
if (e1[4] == min(es)) {
best[i] <- 1
} else {
if (e2[4] == min(es)) {
best[i] <- 2
} else {
if (e3[4] == min(es)) {
best[i] <- 3
} else {
best[i] <- 4
}
}
}
}
## [1] "SVM(1) for store 3"
## Mean HarmonicMean GeometricMean Median Min
## 886.6921131 878.2345570 880.8309748 875.2120908 972.4512192
## Max Random WorstResidual
## 1582.5811268 975.5574794 933.3558481
## [1] "SVM(2) for store 3"
## Mean HarmonicMean GeometricMean Median Min
## 842.2599582 843.6035004 840.3440793 827.8073063 1076.6447985
## Max Random WorstResidual
## 1577.7096684 1033.9610190 910.8981422
## [1] "SVM(3) for store 3"
## Mean HarmonicMean GeometricMean Median Min
## 831.6306193 880.1541002 846.6431962 821.2512171 1580.8447852
## Max Random WorstResidual
## 984.9484978 963.5207627 895.3741700
## [1] "RF for store 3"
## Mean HarmonicMean GeometricMean Median Min
## 7279.432151 7279.432151 7279.432151 7279.432151 7279.432151
## Max Random WorstResidual
## 7279.432151 7279.432151 7279.432151
## [1] "SVM(1) for store 7"
## Mean HarmonicMean GeometricMean Median Min
## 1071.240280 1079.386513 1074.493088 1063.146507 1452.464972
## Max Random WorstResidual
## 1119.760668 1092.655302 1083.681666
## [1] "SVM(2) for store 7"
## Mean HarmonicMean GeometricMean Median Min
## 1009.9360682 1014.4992647 1011.3200211 996.5205004 1229.0971854
## Max Random WorstResidual
## 1541.5369399 1139.6996573 1048.2345652
## [1] "SVM(3) for store 7"
## Mean HarmonicMean GeometricMean Median Min
## 1091.744389 156922.194464 NaN 1002.387318 3545.710053
## Max Random WorstResidual
## 1307.699712 1436.881639 1093.115542
## [1] "RF for store 7"
## Mean HarmonicMean GeometricMean Median Min
## 9150.327463 9150.327463 9150.327463 9150.327463 9150.327463
## Max Random WorstResidual
## 9150.327463 9150.327463 9150.327463
## [1] "SVM(1) for store 8"
## Mean HarmonicMean GeometricMean Median Min
## 587.7418121 587.3411995 587.5004205 587.9790325 663.3869310
## Max Random WorstResidual
## 697.0941580 620.1082543 625.1968176
## [1] "SVM(2) for store 8"
## Mean HarmonicMean GeometricMean Median Min
## 579.9801216 574.0231715 575.9229349 574.8261956 667.9956183
## Max Random WorstResidual
## 1120.9254396 684.7632993 632.6027797
## [1] "SVM(3) for store 8"
## Mean HarmonicMean GeometricMean Median Min
## 571.6371083 572.4031224 571.3105374 561.4358880 786.3052653
## Max Random WorstResidual
## 804.2303737 628.5468113 607.8182412
## [1] "RF for store 8"
## Mean HarmonicMean GeometricMean Median Min
## 5853.845733 5853.845733 5853.845733 5853.845733 5853.845733
## Max Random WorstResidual
## 5853.845733 5853.845733 5853.845733
## [1] "SVM(1) for store 9"
## Mean HarmonicMean GeometricMean Median Min
## 803.9919502 760.3072135 773.3883674 755.0415969 797.6615622
## Max Random WorstResidual
## 2709.9877874 1187.7048660 802.6902016
## [1] "SVM(2) for store 9"
## Mean HarmonicMean GeometricMean Median Min
## 760.2129518 731.6930501 739.1318359 714.4913989 836.3047866
## Max Random WorstResidual
## 2522.0609758 909.2143319 803.5124436
## [1] "SVM(3) for store 9"
## Mean HarmonicMean GeometricMean Median Min
## 816.2601570 731.5231506 755.4216404 711.8659913 819.1056757
## Max Random WorstResidual
## 3479.5095344 1317.7108174 788.8333216
## [1] "RF for store 9"
## Mean HarmonicMean GeometricMean Median Min
## 6761.167323 6761.167323 6761.167323 6761.167323 6761.167323
## Max Random WorstResidual
## 6761.167323 6761.167323 6761.167323
## [1] "SVM(1) for store 10"
## Mean HarmonicMean GeometricMean Median Min
## 541.1113955 537.5541260 538.8075977 537.3477384 597.2329503
## Max Random WorstResidual
## 1005.8683356 588.8779090 567.2506384
## [1] "SVM(2) for store 10"
## Mean HarmonicMean GeometricMean Median Min
## 522.2504587 517.5846677 519.3817972 519.2821079 608.0763803
## Max Random WorstResidual
## 1033.1234859 633.2199619 582.5228987
## [1] "SVM(3) for store 10"
## Mean HarmonicMean GeometricMean Median Min
## 516.7513337 512.9365700 514.0687868 511.4762694 638.4452964
## Max Random WorstResidual
## 1071.7643012 631.6177377 548.4928346
## [1] "RF for store 10"
## Mean HarmonicMean GeometricMean Median Min
## 5666.637658 5666.637658 5666.637658 5666.637658 5666.637658
## Max Random WorstResidual
## 5666.637658 5666.637658 5666.637658
## [1] "SVM(1) for store 11"
## Mean HarmonicMean GeometricMean Median Min
## 1137.486314 1115.542678 1123.541125 1105.974839 1203.042213
## Max Random WorstResidual
## 2451.065440 1235.981269 1425.388899
## [1] "SVM(2) for store 11"
## Mean HarmonicMean GeometricMean Median Min
## 1081.886595 1050.466724 1059.875718 1051.957339 1236.542612
## Max Random WorstResidual
## 2783.116091 1397.780945 1050.963688
## [1] "SVM(3) for store 11"
## Mean HarmonicMean GeometricMean Median Min
## 1008.180491 2675.181708 NaN 1002.136206 1466.293025
## Max Random WorstResidual
## 1874.013761 1179.120886 1147.447585
## [1] "RF for store 11"
## Mean HarmonicMean GeometricMean Median Min
## 8336.848314 8336.848314 8336.848314 8336.848314 8336.848314
## Max Random WorstResidual
## 8336.848314 8336.848314 8336.848314
## [1] "SVM(1) for store 12"
## Mean HarmonicMean GeometricMean Median Min
## 904.0394359 876.9985210 886.1881400 859.4536710 924.3889094
## Max Random WorstResidual
## 2424.3784360 1249.8088251 921.4300465
## [1] "SVM(2) for store 12"
## Mean HarmonicMean GeometricMean Median Min
## 878.8461784 841.4492324 852.5346182 835.9096094 943.2381004
## Max Random WorstResidual
## 2955.1499747 1430.8224306 950.5872758
## [1] "SVM(3) for store 12"
## Mean HarmonicMean GeometricMean Median Min
## 907.2567421 845.3606695 861.1895857 842.6188198 1005.5352834
## Max Random WorstResidual
## 3541.4955625 1283.4030277 1046.5859341
## [1] "RF for store 12"
## Mean HarmonicMean GeometricMean Median Min
## 7861.834065 7861.834065 7861.834065 7861.834065 7861.834065
## Max Random WorstResidual
## 7861.834065 7861.834065 7861.834065
## [1] "SVM(1) for store 13"
## Mean HarmonicMean GeometricMean Median Min
## 721.7162910 724.2015922 722.6623421 725.4628937 840.2247204
## Max Random WorstResidual
## 929.2918098 761.5470674 809.4916836
## [1] "SVM(2) for store 13"
## Mean HarmonicMean GeometricMean Median Min
## 721.0671970 720.3311338 719.9750514 720.7444773 847.6468035
## Max Random WorstResidual
## 1171.4703679 838.0380973 810.7961776
## [1] "SVM(3) for store 13"
## Mean HarmonicMean GeometricMean Median Min
## 732.0027586 739.2274107 734.5988802 734.0670337 1084.3334400
## Max Random WorstResidual
## 901.5887130 797.6257060 779.4568822
## [1] "RF for store 13"
## Mean HarmonicMean GeometricMean Median Min
## 5286.844562 5286.844562 5286.844562 5286.844562 5286.844562
## Max Random WorstResidual
## 5286.844562 5286.844562 5286.844562
## [1] "SVM(1) for store 14"
## Mean HarmonicMean GeometricMean Median Min
## 603.8128637 594.9630309 598.2647708 594.5538619 663.9923708
## Max Random WorstResidual
## 1204.6962953 691.4602435 622.3655136
## [1] "SVM(2) for store 14"
## Mean HarmonicMean GeometricMean Median Min
## 570.2212718 570.6726597 569.7971573 565.9134143 699.8272429
## Max Random WorstResidual
## 832.8007677 671.9448978 641.7335939
## [1] "SVM(3) for store 14"
## Mean HarmonicMean GeometricMean Median Min
## 568.9376912 570.5371586 569.1897497 561.0756271 727.5160909
## Max Random WorstResidual
## 840.5882147 657.2871851 757.3678156
## [1] "RF for store 14"
## Mean HarmonicMean GeometricMean Median Min
## 5702.6173 5702.6173 5702.6173 5702.6173 5702.6173
## Max Random WorstResidual
## 5702.6173 5702.6173 5702.6173
## [1] "SVM(1) for store 15"
## Mean HarmonicMean GeometricMean Median Min
## 823.1962536 790.6439707 800.8401862 782.8023396 866.2328633
## Max Random WorstResidual
## 2425.9092020 1105.6531487 832.6074777
## [1] "SVM(2) for store 15"
## Mean HarmonicMean GeometricMean Median Min
## 751.6079078 746.0290621 748.1944698 735.4663313 892.8401471
## Max Random WorstResidual
## 1288.0163486 856.0561395 851.7882099
## [1] "SVM(3) for store 15"
## Mean HarmonicMean GeometricMean Median Min
## 762.4185249 745.8740980 750.4112548 741.9594954 938.8330296
## Max Random WorstResidual
## 1941.3534453 1064.2254154 813.3035391
## [1] "RF for store 15"
## Mean HarmonicMean GeometricMean Median Min
## 6866.798896 6866.798896 6866.798896 6866.798896 6866.798896
## Max Random WorstResidual
## 6866.798896 6866.798896 6866.798896
## [1] "SVM(1) for store 16"
## Mean HarmonicMean GeometricMean Median Min
## 736.0852472 719.1100320 725.1983297 702.2166189 776.8328010
## Max Random WorstResidual
## 1859.1902332 931.3263433 751.9494886
## [1] "SVM(2) for store 16"
## Mean HarmonicMean GeometricMean Median Min
## 687.8638801 671.5744547 677.3507078 663.0925373 772.7677931
## Max Random WorstResidual
## 1835.3372596 845.8755141 765.3886367
## [1] "SVM(3) for store 16"
## Mean HarmonicMean GeometricMean Median Min
## 694.7667917 670.1098699 677.6090695 656.4446567 804.4070440
## Max Random WorstResidual
## 2157.7477967 969.2154324 1709.1294958
## [1] "RF for store 16"
## Mean HarmonicMean GeometricMean Median Min
## 7912.951995 7912.951995 7912.951995 7912.951995 7912.951995
## Max Random WorstResidual
## 7912.951995 7912.951995 7912.951995
## [1] "SVM(1) for store 19"
## Mean HarmonicMean GeometricMean Median Min
## 724.1964028 713.1630456 717.5309387 708.3380552 771.1089626
## Max Random WorstResidual
## 1367.8701759 784.3305282 752.8509324
## [1] "SVM(2) for store 19"
## Mean HarmonicMean GeometricMean Median Min
## 703.1996247 681.3446891 687.6399956 668.9656044 803.1113181
## Max Random WorstResidual
## 2116.3241258 960.1600227 715.1356376
## [1] "SVM(3) for store 19"
## Mean HarmonicMean GeometricMean Median Min
## 706.1614159 675.0823639 684.2140466 664.9845195 774.1792958
## Max Random WorstResidual
## 2269.5609221 980.4870871 729.1072629
## [1] "RF for store 19"
## Mean HarmonicMean GeometricMean Median Min
## 6720.942768 6720.942768 6720.942768 6720.942768 6720.942768
## Max Random WorstResidual
## 6720.942768 6720.942768 6720.942768
## [1] "SVM(1) for store 20"
## Mean HarmonicMean GeometricMean Median Min
## 754.2735776 753.4836821 753.5391313 758.8901111 883.5699669
## Max Random WorstResidual
## 1050.6965095 843.5227834 814.7508368
## [1] "SVM(2) for store 20"
## Mean HarmonicMean GeometricMean Median Min
## 754.6761114 753.3091153 753.8403900 765.6914858 902.8926035
## Max Random WorstResidual
## 1076.3122725 809.4277620 830.9759715
## [1] "SVM(3) for store 20"
## Mean HarmonicMean GeometricMean Median Min
## 763.6950493 766.7143385 764.7837020 767.5652662 1008.1508108
## Max Random WorstResidual
## 1052.1450029 833.1155438 903.2547734
## [1] "RF for store 20"
## Mean HarmonicMean GeometricMean Median Min
## 7939.473525 7939.473525 7939.473525 7939.473525 7939.473525
## Max Random WorstResidual
## 7939.473525 7939.473525 7939.473525
## [1] "SVM(1) for store 21"
## Mean HarmonicMean GeometricMean Median Min
## 739.4236181 735.1079984 736.3933183 731.2712034 842.6407425
## Max Random WorstResidual
## 1192.9262575 829.2542327 777.2229044
## [1] "SVM(2) for store 21"
## Mean HarmonicMean GeometricMean Median Min
## 725.3785723 722.8530738 723.6889388 718.2075379 832.5824423
## Max Random WorstResidual
## 1100.4462701 820.3624635 795.4384827
## [1] "SVM(3) for store 21"
## Mean HarmonicMean GeometricMean Median Min
## 719.7755747 723.5715856 720.8928068 721.8340240 1017.1181317
## Max Random WorstResidual
## 904.7028094 763.4841907 766.6314524
## [1] "RF for store 21"
## Mean HarmonicMean GeometricMean Median Min
## 5766.073943 5766.073943 5766.073943 5766.073943 5766.073943
## Max Random WorstResidual
## 5766.073943 5766.073943 5766.073943
spcombined <- spsvm[2]
spcombined <- cbind(spcombined,spsvm[seq(3,ncol(spsvm),3)])
spcombined <- cbind(spcombined,spsvm2[seq(2,ncol(spsvm2),3)])
spcombined <- cbind(spcombined,spsvm3[seq(2,ncol(spsvm3),3)])
spcombined <- cbind(spcombined,sprf[seq(2,ncol(sprf),3)])
spcombined$Mean <- apply(spcombined[,2:37],1,mean)
spcombined$HarmonicMean <- apply(spcombined[,2:37],1,function(x) {1/mean(1/x)})
spcombined$Median <- apply(spcombined[,2:37],1,median)
spcombined$SVM1Median <- apply(spcombined[,2:10],1,median)
spcombined$SVM2Median <- apply(spcombined[,11:19],1,median)
spcombined$SVM3Median <- apply(spcombined[,20:28],1,median)
spcombined$RFMedian <- apply(spcombined[,29:37],1,median)
spcombined$Median2_4 <- apply(spcombined[,11:37],1,median)
spcombined$Mean2_4 <- apply(spcombined[,11:37],1,mean)
spcombined$Mean3_4 <- apply(spcombined[,20:37],1,mean)
spcombined$Store <- spsvm[1]
spcombined$BestIndex <- apply(spcombined[,48],1,function(x){best[x[1]]})
spcombined$BestModelMedian <- apply(spcombined[,c(41:44,49)],1,function(x){x[x[5]]})
errors <- rep(0,11)
errors[1] <- sqrt(mean((spcombined$Sales-spcombined$Mean)^2))
errors[2] <- sqrt(mean((spcombined$Sales-spcombined$HarmonicMean)^2))
errors[3] <- sqrt(mean((spcombined$Sales-spcombined$Median)^2))
errors[4] <- sqrt(mean((spcombined$Sales-spcombined$SVM1Median)^2))
errors[5] <- sqrt(mean((spcombined$Sales-spcombined$SVM2Median)^2))
errors[6] <- sqrt(mean((spcombined$Sales-spcombined$SVM3Median)^2))
errors[7] <- sqrt(mean((spcombined$Sales-spcombined$RFMedian)^2))
errors[8] <- sqrt(mean((spcombined$Sales-spcombined$Median2_4)^2))
errors[9] <- sqrt(mean((spcombined$Sales-spcombined$Mean2_4)^2))
errors[10] <- sqrt(mean((spcombined$Sales-spcombined$Mean3_4)^2))
errors[11] <- sqrt(mean((spcombined$Sales-spcombined$BestModelMedian)^2))
names(errors) = c("Mean","HarmonicMean","Median","SVM1Median","SVM2Median","SVM3Median","RFMedian","Median2_4","Mean2_4","Mean3_4","BestModelMedian")
errors
## Mean HarmonicMean Median SVM1Median
## 1849.4698715 6910.6134247 754.5727477 773.8190255
## SVM2Median SVM3Median RFMedian Median2_4
## 739.6420578 734.3044560 6912.1748825 756.6036705
## Mean2_4 Mean3_4 BestModelMedian
## 2380.5779594 3502.0374637 731.3195810
The results for the first 15 stores are quite clear: SVMs work better than random forests for all shops. If we take the median of the predictions generated by support vector machines using any settings, we get a more accurate prediction overall than by taking the median of random-forest-based predictions. The improvement is rather substantial.
What is surprising, however, is the fact despite the sometimes far worse accuracy of random forest predictions, taking any kind of mean of all 36 predictions that were generated with the four methods (i.e. the “Mean”, “HarmonicMean” and “Median” figures) actually yields better results than the most accurate SVM-based predictions (which is mostly the SVM3Median value, which is the error of the median prediction generated by SVMs at C=1000). In fact, for the first 15 shops we find that taking the mean of all predictions generated by C=100 and C=1000 SVMs together with the random forests (i.e. the “Median2_4” and “Mean2_4” results) yields the lowest cross-validation error.
Finally, it is worth noting that always choosing the median of the predictions generated by the model with the lowest RMSE, i.e. the most accurate type of model, is still slightly less accurate than choosing the mean or median of all predictions.
The logical next step is to repeat this same process on the whole data set (all stores) to see whether it is possible to confirm this. Fitting this amount of predictions to the training set takes relatively long, 10+ hours for both SVMs and RFs, so I have not run the code inside the R markup but separately, using the lines below. The results confirmed the above findings.
sp <- combineRFForStore(1,9)
for (x in teststores[-1]) {
sp <- rbind(sp, combineRFForStore(x,9))
}
I conclude from the outcomes of these experiments on subsets of the whole training data that it is consistently very useful to generate predictions using several different models (SVMs using different parameters, with the data transformed by PCA, as well as random forests) and combine them by taking the mean or median of all predictions. This led to results that were clearly superior to generating single predictions using any of the methods, and even to generating several predictions, all using the same method.
I generated predictions for the test data in the Kaggle competition based on what I have learned about the data above.
First I add the additional variables like I did for the training table.
ttest$DateYear <- as.numeric(strftime(ttest$Date, format="%y"))
ttest$DateMonth <- as.numeric(strftime(ttest$Date, format="%m"))
ttest$DateDay <- as.numeric(strftime(ttest$Date, format="%d"))
ttest$DateWeek <- as.numeric(strftime(ttest$Date, format="%W"))
ttest <- ttest[c(1,4:11,2)]
ttest$Competition <- F
ttest$Promo2 <- F
for (i in teststores) {
if (!is.na(s[i,5])) {
m <- s[i,5]
y <- s[i,6] - 2000
ttest[ttest$Store == i,]$Competition <- ttest[ttest$Store == i,]$DateYear > y | (ttest[ttest$Store == i,]$DateMonth >= m & ttest[ttest$Store == i,]$DateYear == y)
}
if (s[i,7]) {
w <- s[i,8]
y <- s[i,9] - 2000
ttest[ttest$Store == i,]$Promo2 <- ttest[ttest$Store == i,]$DateYear > y | (ttest[ttest$Store == i,]$DateWeek >= w & ttest[ttest$Store == i,]$DateYear == y)
}
}
This table will be used for the random forest predictions. I also create its binary version for the SVM predictions. The problem here is that this cannot be accomplished simply by transforming the numeric variables that are present in the test set into binary variables, since for some variables there are less values present in the test table than in the training table. For example, the Year variable had three possible values in the training table (13, 14, 15) but only one value (15) in the test table. Thus automatic transformation would lead to two variables in the training table (14TRUE, 15TRUE) but no variable at all in the test table (since 15 is constant in this table). This is a problem because the models will be fitted to the training data, and predictions using these models will only work is the very same variables are present in the test data as in the training data.
The solution is to combine the rows of the training and the test table before carrying out the binary transformation, and then extracting the rows with the test data. This is achieved with the following code.
ttemp1 <- t[3:12]
ttemp2 <- ttest[3:12]
lastTrainRow <- nrow(ttemp1)
ttemp <- rbind(ttemp1,ttemp2)
ttemp <- cbind(rep(1,nrow(ttemp)),ttemp)
names(ttemp)[1] <- "Dummy"
tbtemp <- model.matrix(Dummy ~ Promo + StateHoliday + SchoolHoliday + DayOfWeek + as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + as.factor(DateWeek) + Competition + Promo2, data = ttemp)
tbtest <- cbind(ttest[1:2],tbtemp[(lastTrainRow+1):nrow(tbtemp),2:ncol(tbtemp)])
I also transform the data type for the factor and logical variables in the numeric tables to numeric.
tn <- t
tn$Promo <- as.numeric(tn$Promo)
tn$StateHoliday <- as.numeric(tn$StateHoliday)
tn$SchoolHoliday <- as.numeric(tn$SchoolHoliday)
tn$DayOfWeek <- as.numeric(tn$DayOfWeek)
tn$Competition<- as.numeric(tn$Competition)
tn$Promo2 <- as.numeric(tn$Promo2)
tntest <- ttest
tntest$Promo <- as.numeric(tntest$Promo)
tntest$StateHoliday <- as.numeric(tntest$StateHoliday)
tntest$SchoolHoliday <- as.numeric(tntest$SchoolHoliday)
tntest$DayOfWeek <- as.numeric(tntest$DayOfWeek)
tntest$Competition<- as.numeric(tntest$Competition)
tntest$Promo2 <- as.numeric(tntest$Promo2)
Before generating the actual predictions, I create a new Prediction variable in the ttest table, and set it to 0 everywhere.
ttest$Prediction <- 0
A further complication is due to the fact that the procedure that resulted in the best predictions above cannot be directly carried over to fitting models to the whole data set. Recall that the best results were achieved when at least 6 different SVM-based predictions were generated for each open day of each shop, and we took the median of these predictions. While this can be easily reproduced with random forests by fitting the required number of forests to the whole data set, it will not work with SVMs since their implementation is completely deterministic, i.e. if SVMs are fitted to the same training set and predictions are generated for the same test set several times, then the very same predictions are generated every time.
One option would of course be to generate a single prediction with SVMs, but I chose a different one. I proceed in the same way as earlier, during cross-validation. I divide up the whole training set into ten equal parts and train 10 separate SVMs, on a different 90% subset of the data each. This results in ten separate predictions, each based on a slightly different set of data. I repeat this procedure for C = 100 and C = 1000.
To summarize, the following procedure will be followed for all stores:
print(system.time(for (storeNumber in teststores) {
print(storeNumber)
storen <- tn[tn$Store == storeNumber,2:ncol(tn)]
storentest <- tntest[tntest$Store == storeNumber & tntest$Open,3:ncol(tntest)]
storeb <- tb[tb$Store == storeNumber,2:ncol(tb)]
storeb <- storeb[sample(nrow(storeb)),]
storebtest <- tbtest[tbtest$Store == storeNumber & tbtest$Open,3:ncol(tbtest)]
trank <- qr(storeb[3:ncol(storeb)])$rank
pc <- prcomp(storeb[3:ncol(storeb)])
pctrain <- cbind(storeb[1],pc$x)[1:min(trank-1,100)]
pctest <- as.data.frame(predict(pc,storebtest))[1:min(trank,100)]
predictions <- matrix(rep(0, nrow(storentest) * 29), nrow = nrow(storentest), ncol = 29)
for (c in 1:9) {
modell <- randomForest(Sales ~ .,storen, ntree=200)
predictions[,c] <- as.vector(predict(modell,storentest))
}
z <- nrow(storen)
shuffledIndices <- sample(z)
for (c in 1:10) {
sampleIndex <- floor(1+0.1*(c-1)*z):(0.1*c*z)
trainParts <- pctrain[shuffledIndices[-sampleIndex],]
modell <- svm(Sales ~ ., trainParts, cost=100,gamma=0.001)
predictions[,c+9] <- as.vector(predict(modell,pctest))
}
shuffledIndices <- sample(z)
for (c in 1:10) {
sampleIndex <- floor(1+0.1*(c-1)*z):(0.1*c*z)
trainParts <- pctrain[shuffledIndices[-sampleIndex],]
modell <- svm(Sales ~ ., trainParts, cost=1000,gamma=0.001)
predictions[,c+19] <- as.vector(predict(modell,pctest))
}
ttest[ttest$Store == storeNumber & ttest$Open,]$Prediction <- apply(predictions,1,median)
}))
Finally the prediction is saved in a simple two-column CSV file as required.
submission <- cbind(1:nrow(ttest),ttest$Prediction)
write.csv(submission,file="submission.csv",sep=",",row.names=FALSE)
The result was submitted to the Kaggle competition system for evaluation. Unfortunately the true accuracy of the predictions turned out to be far from satisfactory: the specific code above only came in 2350th out of 3300 entries. While certainly better at a root mean squared percentage error of about 0.154 than the baseline that was set by Rossmann (which was the median of the sales for the day of week for the given shop, and achieved an RMSPE about 0.24), it was still very far from the winner’s accuracy with a score of just 0.1. Suprisingly another submission of mine that used a less sophisticated version of the above procedure, which worked with just one SVM at C=100, one at C=1000, and the median of 9 random forest predictions, and took the mean of these three figures, fared much better and reached position 2170 with a score of about 0.147 - quite a bit better, but still clearly not worth the amount of time and energy that was invested. It is not clear to me how the procedure could be further improved.