1. Introduction, Data Sets

This paper presents a solution for the prediction task which was published by Rossmann on Kaggle in September 2015 as a Kaggle Competition. The task was to forecast sales for 1115 Rossmann stores in August and September 2015 based on historical data from January 2013 to July 2015.

The task can be accessed at the following URL: Rossmann Store Sales

The data to be used to solve the task were published in three CSV files on Kaggle’s site. The files are the following:

train.csv: Contains the training data on which the predictions must be based. The variables included in this file are as follows:

t <- read.csv('F:/prog/RStudio/rossmann/train.csv',colClasses=c("factor","factor","Date","integer","integer","factor","factor","factor","factor"))
t[,6] <- t[,6] == 1
t[,7] <- t[,7] == 1
t[,9] <- t[,9] == 1
summary(t)

##      Store         DayOfWeek       Date                Sales      
##  1      :    942   1:144730   Min.   :2013-01-01   Min.   :    0  
##  10     :    942   2:145664   1st Qu.:2013-08-17   1st Qu.: 3727  
##  1001   :    942   3:145665   Median :2014-04-02   Median : 5744  
##  1002   :    942   4:145845   Mean   :2014-04-11   Mean   : 5774  
##  1003   :    942   5:145845   3rd Qu.:2014-12-12   3rd Qu.: 7856  
##  1005   :    942   6:144730   Max.   :2015-07-31   Max.   :41551  
##  (Other):1011557   7:144730                                       
##    Customers         Open           Promo         StateHoliday
##  Min.   :   0.0   Mode :logical   Mode :logical   0:986159    
##  1st Qu.: 405.0   FALSE:172817    FALSE:629129    a: 20260    
##  Median : 609.0   TRUE :844392    TRUE :388080    b:  6690    
##  Mean   : 633.1   NA's :0         NA's :0         c:  4100    
##  3rd Qu.: 837.0                                               
##  Max.   :7388.0                                               
##                                                               
##  SchoolHoliday  
##  Mode :logical  
##  FALSE:835488   
##  TRUE :181721   
##  NA's :0        
##                 
##                 
##

The variables are explained briefly on the competition site, which I will not repeat here. It is worth noting that the data contain a separate DayOfWeek variable, which is really redundant since this information can be easily extracted from the Date variable. In the above code I interpreted the 1 and 0 values that were provided for the boolean variables in the respective columns of the data file to logical TRUE and FALSE values. Although this is not immediately apparent in the documentation but becomes clear when the data are inspected, the SchoolHoliday variable is TRUE for the days of the Easter, summer, autumn and Christmas school holidays but FALSE for days like Sunday or state holidays when schools are also closed.

store.csv: Provides information on the Rossmann stores, especially whether the store is participating in a continuing promotion programme and the distance of the nearest competition store. The type and assortment size of the store are also characterized.

s <- read.csv('F:/prog/RStudio/rossmann/store.csv',colClasses=c("factor","factor","factor","integer","integer","integer","factor","integer","integer","character"))
s[,7] <- s[,7] == 1
summary(s)

##      Store      StoreType Assortment CompetitionDistance
##  1      :   1   a:602     a:593      Min.   :   20.0    
##  10     :   1   b: 17     b:  9      1st Qu.:  717.5    
##  100    :   1   c:148     c:513      Median : 2325.0    
##  1000   :   1   d:348                Mean   : 5404.9    
##  1001   :   1                        3rd Qu.: 6882.5    
##  1002   :   1                        Max.   :75860.0    
##  (Other):1109                        NA's   :3          
##  CompetitionOpenSinceMonth CompetitionOpenSinceYear   Promo2       
##  Min.   : 1.000            Min.   :1900             Mode :logical  
##  1st Qu.: 4.000            1st Qu.:2006             FALSE:544      
##  Median : 8.000            Median :2010             TRUE :571      
##  Mean   : 7.225            Mean   :2009             NA's :0        
##  3rd Qu.:10.000            3rd Qu.:2013                            
##  Max.   :12.000            Max.   :2015                            
##  NA's   :354               NA's   :354                             
##  Promo2SinceWeek Promo2SinceYear PromoInterval     
##  Min.   : 1.0    Min.   :2009    Length:1115       
##  1st Qu.:13.0    1st Qu.:2011    Class :character  
##  Median :22.0    Median :2012    Mode  :character  
##  Mean   :23.6    Mean   :2012                      
##  3rd Qu.:37.0    3rd Qu.:2013                      
##  Max.   :50.0    Max.   :2015                      
##  NA's   :544     NA's   :544

s[2,]

##   Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth
## 2     2         a          a                 570                        11
##   CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear
## 2                     2007   TRUE              13            2010
##     PromoInterval
## 2 Jan,Apr,Jul,Oct

As the example shows, the PromoInterval is presented in the table as a string list of month name abbreviations. It is not immediately clear to what extent the information in this file is useful in calculating our predictions. Initially I will ignore this table and work with the training file only.

test.csv: This table contains the same columns as train.csv, but the sales and the number of customers are unknown. The task is to predict sales only. Note that the “test” table cannot be used for testing purposes directly since it does not include any relevant data. It is used only for generating the predictions, which are then submitted to Kaggle’s evaluation system, which in turn assesses predictions using a root mean squared percentage error metric.

ttest <- read.csv('F:/prog/RStudio/rossmann/test.csv',colClasses=c("numeric","factor","factor","Date","integer","integer","factor","factor"))
ttest[,5] <- ttest[,5] == 1
ttest[,6] <- ttest[,6] == 1
ttest[,8] <- ttest[,8] == 1
summary(ttest)

##        Id            Store       DayOfWeek      Date           
##  Min.   :    1   1      :   48   1:5992    Min.   :2015-08-01  
##  1st Qu.:10273   10     :   48   2:5992    1st Qu.:2015-08-12  
##  Median :20545   100    :   48   3:5992    Median :2015-08-24  
##  Mean   :20545   1000   :   48   4:5992    Mean   :2015-08-24  
##  3rd Qu.:30816   1003   :   48   5:5136    3rd Qu.:2015-09-05  
##  Max.   :41088   1004   :   48   6:5992    Max.   :2015-09-17  
##                  (Other):40800   7:5992                        
##     Open           Promo         StateHoliday SchoolHoliday  
##  Mode :logical   Mode :logical   0:40908      Mode :logical  
##  FALSE:5984      FALSE:24824     a:  180      FALSE:22866    
##  TRUE :35093     TRUE :16264                  TRUE :18222    
##  NA's :11        NA's :0                      NA's :0        
##                                                              
##                                                              
##

head(ttest)

##   Id Store DayOfWeek       Date Open Promo StateHoliday SchoolHoliday
## 1  1     1         4 2015-09-17 TRUE  TRUE            0         FALSE
## 2  2     3         4 2015-09-17 TRUE  TRUE            0         FALSE
## 3  3     7         4 2015-09-17 TRUE  TRUE            0         FALSE
## 4  4     8         4 2015-09-17 TRUE  TRUE            0         FALSE
## 5  5     9         4 2015-09-17 TRUE  TRUE            0         FALSE
## 6  6    10         4 2015-09-17 TRUE  TRUE            0         FALSE

ttest <- ttest[2:8]

The first column is new, it is simply a row ID that will not be necessary since R automatically adds a row number to tables anyway that are retained when the rows of the table are reordered.

2. Pre-processing

2.1. Customers

First of all, it should be noted that the customers variable in the data set serves no purpose: On the one hand, it is not given in the input variables in the test set (which is unfortunate, since the number of customers is presumably the best predictor of the sales of a store), so fitting a model to this variable would be useless from the perspective of the task. On the other hand, the task is not to predict the number of customers, so fitting a model based on the given variables in the test set to predict the number of customers surely makes no sense. Therefore this variable can be dropped from the training set.

2.2. Sales of closed shops

As can be expected, sales of closed shops are zero:

summary(t$Sales[!t$Open])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

Therefore, predicting sales for closed shops is trivial, so I remove all data for closed shops from the data set, which reduces its size by almost one-fifth. The Open column also becomes superfluous.

length(t$Sales[!t$Open]) / nrow(t)

## [1] 0.1698933

t <- t[t$Open,]

2.3. Dates

Since the usual machine learning algorithms cannot handle dates as an input variable directly, I extract a number of numeric variables from this attribute: year, month, day and week number. The year, month and day variables are obvious. Week number might be useful, as it is coarser than the day but finer than the month variable, and will be necessary if I want to take into consideration the Promo2 type of promotion later (since the starting week of the store’s participation is given in week number).

Although all of these are strictly speaking factor variables, I will store them as numeric values for now and adjust their data type later.

t$DateYear <- as.numeric(strftime(t$Date, format="%y")) # the strftime function extracts a string from the date, so this must be transformed to a numeric value
t$DateMonth <- as.numeric(strftime(t$Date, format="%m"))
t$DateDay <- as.numeric(strftime(t$Date, format="%d"))
t$DateWeek <- as.numeric(strftime(t$Date, format="%W"))
t <- t[c(1,4,7:13,2)] # After having extracted the components of the date, I am dropping the "Customers" and "Date" variables and moving the "DayOfWeek" variable into the last column since this way it is grouped together with the other date components
head(t)

##   Store Sales Promo StateHoliday SchoolHoliday DateYear DateMonth DateDay
## 1     1  5263  TRUE            0          TRUE       15         7      31
## 2     2  6064  TRUE            0          TRUE       15         7      31
## 3     3  8314  TRUE            0          TRUE       15         7      31
## 4     4 13995  TRUE            0          TRUE       15         7      31
## 5     5  4822  TRUE            0          TRUE       15         7      31
## 6     6  5651  TRUE            0          TRUE       15         7      31
##   DateWeek DayOfWeek
## 1       30         5
## 2       30         5
## 3       30         5
## 4       30         5
## 5       30         5
## 6       30         5

2.4. Shops absent from the test data

As we saw at the end of the first section, not all shops are used in the test data. There are 48 days for all stores in the test set, but only 40800 rows, which means that we only need to generate models for 850 stores instead of the total of 1115.

40800 / 48

## [1] 850

Since I will be fitting models for each shop separately in order to predict sales in the test period, data for shops that are absent from the test set serve no purpose. They will not be used to fit models anyway. Therefore, I collect the store IDs from the test set in a separate variable.

teststores <- as.numeric(as.character(unique(ttest$Store)))
teststores[1:10]

##  [1]  1  3  7  8  9 10 11 12 13 14

This variable is used to filter out the data for the unused stores.

nrow(t)

## [1] 844392

t <- t[t$Store %in% teststores,]
nrow(t)

## [1] 641985

This reduces the data to be processed by almost 25 per cent.

2.5. Missing values in test data

It is also apparent from the summary of the test data that there are some NA values in the Open variable which must be filled in for the predictions to be generated properly

ttest[is.na(ttest$Open),]

##       Store DayOfWeek       Date Open Promo StateHoliday SchoolHoliday
## 480     622         4 2015-09-17   NA  TRUE            0         FALSE
## 1336    622         3 2015-09-16   NA  TRUE            0         FALSE
## 2192    622         2 2015-09-15   NA  TRUE            0         FALSE
## 3048    622         1 2015-09-14   NA  TRUE            0         FALSE
## 4760    622         6 2015-09-12   NA FALSE            0         FALSE
## 5616    622         5 2015-09-11   NA FALSE            0         FALSE
## 6472    622         4 2015-09-10   NA FALSE            0         FALSE
## 7328    622         3 2015-09-09   NA FALSE            0         FALSE
## 8184    622         2 2015-09-08   NA FALSE            0         FALSE
## 9040    622         1 2015-09-07   NA FALSE            0         FALSE
## 10752   622         6 2015-09-05   NA FALSE            0         FALSE

We see that all NA values pertain to a single store (622), they are for consecutive days except the Sundays in this period, and contain all other days (Mondays to Saturdays). Furthermore, there are four days at the end of the period when there is an active promotion running in this store. Since the intervening Sundays are explicitly marked as 0, we can assume that these are all open days. Therefore I set the NA values to true.

ttest[ttest$Store == 622 & ttest$Date > "2015-09-05" & ttest$DayOfWeek == 7,]

##      Store DayOfWeek       Date  Open Promo StateHoliday SchoolHoliday
## 3904   622         7 2015-09-13 FALSE FALSE            0         FALSE
## 9896   622         7 2015-09-06 FALSE FALSE            0         FALSE

ttest[is.na(ttest$Open),]$Open <- T

3. Exploratory Data Analysis

I examine the data according to the usual descriptive statistics.

summary(t$Sales)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4848    6299    6855    8198   41550

sd(t$Sales)

## [1] 2953.504

hist(t$Sales,xlab="Sales Amount")

boxplot(t$Sales)

Sales above 20000 appear to be outliers in the data set, so these should be checked in order to ascertain that the data are not corrupted.

summary(t[t$Sales > 20000,])

##      Store         Sales         Promo         StateHoliday
##  1114   :429   Min.   :20002   Mode :logical   0:2464      
##  262    :407   1st Qu.:20747   FALSE:713       a:  23      
##  251    :263   Median :21868   TRUE :1788      b:  10      
##  562    :221   Mean   :22902   NA's :0         c:   4      
##  842    :188   3rd Qu.:23832                               
##  383    :123   Max.   :41551                               
##  (Other):870                                               
##  SchoolHoliday      DateYear       DateMonth         DateDay     
##  Mode :logical   Min.   :13.00   Min.   : 1.000   Min.   : 1.00  
##  FALSE:1971      1st Qu.:13.00   1st Qu.: 4.000   1st Qu.: 5.00  
##  TRUE :530       Median :14.00   Median : 6.000   Median :16.00  
##  NA's :0         Mean   :13.84   Mean   : 6.904   Mean   :15.05  
##                  3rd Qu.:14.00   3rd Qu.:11.000   3rd Qu.:23.00  
##                  Max.   :15.00   Max.   :12.000   Max.   :31.00  
##                                                                  
##     DateWeek    DayOfWeek
##  Min.   : 0.0   1:926    
##  1st Qu.:15.0   2:367    
##  Median :26.0   3:231    
##  Mean   :27.5   4:216    
##  3rd Qu.:44.0   5:336    
##  Max.   :52.0   6:255    
##                 7:170

It appears that it is simply the case that certain stores (1114, 262, etc.) consistently produce very high sales, which means that these sales data are very likely no mistakes but correct. It is also worth noting that the exceedingly high sales are spread across all kinds of dates (also over all days of the week, Monday being the best day for sales by far), school holiday or not, promo or not, etc. We can conclude that the data are probably completely dependable and of very high quality.

One aspect that must be checked is whether sufficient historical data are available for all stores. For this reason I count the number of rows for each store.

openDays <- aggregate(t$Store,list(t$Store),length)
head(openDays)

##   Group.1   x
## 1       1 781
## 2      10 784
## 3     100 606
## 4    1000 622
## 5    1003 779
## 6    1004 622

We have historical data on about 781 days for store 1, etc. Statistics for all stores show that we have a rather balanced number of open days, with some outliers but very little variation otherwise. Even the store with the smallest number of days (592) was open for at least more than 1.5 years, so there should be enough data to base our predictions on.

summary(openDays)

##     Group.1          x        
##  1      :  1   Min.   :592.0  
##  10     :  1   1st Qu.:770.5  
##  100    :  1   Median :779.0  
##  1000   :  1   Mean   :750.0  
##  1003   :  1   3rd Qu.:781.0  
##  1004   :  1   Max.   :942.0  
##  (Other):850

Next, I investigate how sales are related to the input variables throughout the data set:

t.test(t[t$Promo,]$Sales,t[!t$Promo,]$Sales)

## 
##  Welch Two Sample t-test
## 
## data:  t[t$Promo, ]$Sales and t[!t$Promo, ]$Sales
## t = 306.39, df = 553080, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2146.504 2174.143
## sample estimates:
## mean of x mean of y 
##  8052.484  5892.160

Not surprisingly, promotions have a very significant effect on sales, which are about 40 per cent higher on average during promotions.

t.test(t[t$StateHoliday != 0,]$Sales,t[t$StateHoliday == 0,]$Sales)

## 
##  Welch Two Sample t-test
## 
## data:  t[t$StateHoliday != 0, ]$Sales and t[t$StateHoliday == 0, ]$Sales
## t = 7.9118, df = 553.16, p-value = 1.39e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1844.204 3062.364
## sample estimates:
## mean of x mean of y 
##  9305.717  6852.433

t.test(t[t$SchoolHoliday,]$Sales,t[!t$SchoolHoliday,]$Sales)

## 
##  Welch Two Sample t-test
## 
## data:  t[t$SchoolHoliday, ]$Sales and t[!t$SchoolHoliday, ]$Sales
## t = 27.179, df = 179800, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  242.7369 280.4673
## sample estimates:
## mean of x mean of y 
##  7066.096  6804.493

State holidays of all types have a very large effect on sales as well, whereas the impact of school holidays is relatively negligible, although statistically still highly significant.

tapply(t$Sales,t$DateYear,mean)

##       13       14       15 
## 6695.345 6930.516 7012.287

Sales are growing each year, which is probably due to the fact that Rossmann’s prices are being raised on average little by little (even though there is essentially no inflation in Germany).

tapply(t$Sales,t$DateMonth,mean)

##        1        2        3        4        5        6        7        8 
## 6460.397 6494.003 6866.441 6938.716 7023.093 6926.269 6874.591 6586.886 
##        9       10       11       12 
## 6436.536 6476.993 7076.840 8447.998

plot(tapply(t$Sales,t$DateMonth,mean),xlab="Month",ylab="Mean of Sales")

As expected, the sales are highest in December, during the holiday season, whereas they are about constant with little variation during the other months.

tapply(t$Sales,t$DayOfWeek,mean)

##        1        2        3        4        5        6        7 
## 8006.594 6907.936 6593.053 6609.182 6973.977 6040.803 8030.835

Sales are highest on Mondays and Sundays and at a very even level from Tuesday to Friday. Sales are lowest on Saturdays, which might have to do with the fact that some stores close earlier on Saturday. Stores in Germany are generally closed on Sundays and only open on a few days a year, presumably at times when very good sales can be expected, which probably explains the high sales seen on Sundays.

plot(tapply(t$Sales,t$DateDay,mean),xlab="Day of Month",ylab="Mean of Sales")

Most people evidently get their salaries at the end of the month, around the 29th or 30th, sometimes a few days earlier. As expected, sales (and spendings) are lowest during the days prior to the salaries being paid. Surprisingly there is also a slightly lower hump around the middle of each month. The day of the month has a very significant effect on sales altogether.

plot(tapply(t$Sales,t$DateWeek,mean),xlab="Week Number",ylab="Mean of Sales")

On first sight the plot for the week numbers looks extremely noisy. However, what we are seeing here is in fact the rather strong day-of-month-related fluctuation that we saw directly above superposed over the relatively flat month-by-month variation. Nevertheless, some noise is surely present because the range of days covered by each week is different each year. For example, the fifth week of 2013 was Jan 28 to Feb 3, whereas in 2015 it was Jan 26 to Feb 1. These ranges of dates are associated with different average sales, which will appear as random noise in the above plot.

The question remains whether the weekly data, despite looking noisy, are useful for predicting sales. To answer this question, I plot the mean sales by week number for each individual year in the following figure, with the years color-coded:

plot(tapply(t[t$DateYear == 13,]$Sales,t[t$DateYear == 13,]$DateWeek,mean),xlab="Week Number",ylab="Mean of Sales")
points(tapply(t[t$DateYear == 14,]$Sales,t[t$DateYear == 14,]$DateWeek,mean),col="red")
points(tapply(t[t$DateYear == 15,]$Sales,t[t$DateYear == 15,]$DateWeek,mean),col="blue")
legend("top",c("2013","2014","2015"),col=c("black","red","blue"),pch=c(1,1,1))

The results are relatively surprising. Despite the very noisy appearance of the earlier week number plot, it seems that the sales are in fact rather consistent per week number, i.e. the black, blue and red circles are very close to each other for most weeks, e.g. weeks 8 to 11 or 27. The only weeks where we see really huge differences between the years are weeks 14 to 17, but these differences are due to the fact that Easter is celebrated in different weeks each year. In 2013 (black) it was week 14, in 2014 (red) week 17 and in 2015 (blue) week 15, which are exactly the three conspicuous outlier circles visible in the plot. Thus we can conclude that, surprisingly, week numbers are presumably a rather good predictor variable for sales. It must be kept in mind, however, that the relationship between the week number and the sales is not linear. Unfortunately, for the weeks that must be predicted (from week 32 onward) the gap between the 2013 and 2014 data is often very wide (see especially weeks 32 to 40) as compared to the weeks in the first half of the year. Unfortunately the task is to predict sales for weeks 31 to 38, so this means that week number will be a less reliable predictor than we would hope for exactly the crucial weeks.

4. Baseline Prediction

Establishing a baseline prediction for this particular data set is relatively straightforward. We can assume that Sales are zero when the store is closed, and we can set the Sales to the mean sales value of non-closed days when it is not closed. Since I have already removed closed days from the data set, I will calculate the baseline for open days only.

The procedure is illustrated with store no. 1:

store <- t[t$Store == 1,]
mean(store$Sales) # mean sales when open

## [1] 4759.096

store$Prediction <- mean(store$Sales)
store$Error <- store$Prediction - store$Sales

The root mean squared error can then be easily calculated:

sqrt(mean((store$Sales-store$Prediction)^2))

## [1] 1011.458

It is worth checking whether the median (which is an equally simple descriptive statistic) possibly offers a better baseline prediction:

store$Prediction <- median(store$Sales)
sqrt(mean((store$Sales-store$Prediction)^2))

## [1] 1017.651

This result indicates that the median is a marginally worse predictor at least for the first store, so I will continue working with the mean.

The errors are calculated in the way described above for all stores:

errors <- rep(0,length(teststores))
means <- aggregate(t$Sales,list(t$Store),mean)
j <- 1
for ( i in teststores) {
  errors[j] <- sqrt(mean((t[t$Store == i,]$Sales - means[means$Group.1 == i,2])^2))
  j <- j + 1
}

The errors vector contains the mean squared errors for each individual store.

summary(errors)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   629.9  1356.0  1711.0  1771.0  2057.0  4760.0

sd(errors)

## [1] 571.9886

hist(errors)

Again the error for some stores is extremely high, so these are worth examining.

manyerrors <- cbind(as.numeric(as.character(teststores[errors > 3500])),errors[errors > 3500])
manyerrors

##       [,1]     [,2]
##  [1,]  234 3504.596
##  [2,]  251 3545.364
##  [3,]  262 4666.399
##  [4,]  335 4759.825
##  [5,]  586 3537.216
##  [6,]  756 3910.047
##  [7,]  831 3615.355
##  [8,]  842 3785.939
##  [9,] 1014 4568.576
## [10,] 1027 3751.706

It appears that the stores with the highest mean squared error overlap with those that have produced the highest sales (we saw 251, 262 and 842 there in particular). This is reasonable, so the data seem to be correct.

Finally, the RMSE for the whole data set is calculated:

t$Prediction <- apply(t[1],1,function(x){means[means$Group.1 == x,2]})
sqrt(mean((t$Sales-t$Prediction)^2))

## [1] 1872.81

The aim is to reduce the MSE attained through modeling as much as possible below this amount.

5. Linear Regression

I will first experiment with linear regression models. I train a separate model for each store on the complete set of features and assess the accuracy of the predictions of these models using 10-fold cross-validation on each store. There are about 800 days of data for every store, so the folds contain about 80 records each, i.e. the training sets are 720 records long.

storelm <- function(storeNumber) {
  store <- t[t$Store == storeNumber,]  # a store is selected
  shuffledIndices <- sample(nrow(store))  # the data for the store are shuffled
  store$Prediction <- 0
  z <- nrow(store)
  for (i in 1:10) {    # 10-fold cross-validation
    sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)  # 10 % of all data rows is selected
    test <- store[shuffledIndices[sampleIndex],]  # it is used as test set
    train <- store[shuffledIndices[-sampleIndex],]  # the rest is used as training set
    modell <- lm(Sales ~ Promo + SchoolHoliday + DayOfWeek + as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + as.factor(DateWeek), train)  # a linear model is fitted to the training set
    store[shuffledIndices[sampleIndex],]$Prediction <- predict(modell,test) # predictions are generated for the test set based on the model
  }

  print(paste("Prediction for 13/6/1: ", store[store$DateYear == 13 & store$DateMonth == 6 & store$DateDay == 1,]$Prediction))
  
  ### For some reason that is not clear to me, the linear models generate very inaccurate predictions (usually sales above 100,000) for June 1, 2013 for all stores. I work around this problem by falling back to a less specific prediction for that date: the mean of all predictions generated for the current store.

  store[store$DateYear == 13 & store$DateMonth == 6 & store$DateDay == 1,]$Prediction <- mean(store$Prediction)
  
  print(paste("Training error: ", summary.lm(modell)[6]))  # this is the training error of the model that was fitted last

  sqrt(mean((store$Sales-store$Prediction)^2))
}

paste("RMSE: ", storelm(1))

## [1] "Prediction for 13/6/1:  67093.021139132"
## [1] "Training error:  543.679324084399"

## [1] "RMSE:  601.900441744124"

The RMSE for the linear model (around 600) is much better than the baseline (around 1000) for the first shop. The improvement is similar for the stores that had a very high error for the baseline predictions. In these cases the error is approximately halved, i.e. it goes down from about 4000 to about 2000.

print(system.time(print(paste("RMSE: ", storelm(1014)))))

## [1] "Prediction for 13/6/1:  161079.096915653"
## [1] "Training error:  1761.91561342843"
## [1] "RMSE:  2154.65616847458"
##    user  system elapsed 
##    0.39    0.00    0.39

The only exception that I found is store 262, where the error is about twice as high as earlier. This will need to be addressed.

print(system.time(print(paste("RMSE: ", storelm(262)))))

## [1] "Prediction for 13/6/1:  22061.2693322737"
## [1] "Training error:  2593.43079952038"
## [1] "RMSE:  9983.24496923603"
##    user  system elapsed 
##    0.37    0.00    0.38

I calculate a mean squared error score for each of the stores using the cross-validation function defined above, and then I average the RMSE scores, which is a slightly inaccurate way of calculating the RMSE for the whole data set (since not all stores have an exactly equal number of open days), but by far the simplest. Before doing this, however, the original table with the training data must be transformed into one where the input variables are treated as categorical and replaced by separate binary (logical) variables for each level. The latter step is necessary because when we try to run the storelm function as defined above, the predict function for the linear model crashes. This is due to a technical reason: for some stores certain values of the DateWeek variable (especially 52) are so sparse that they do not occur at all in some training folds, and since the lm function reduces the number of factor levels automatically if some levels are not instantiated, the factor variables in the ‘training set’ and ‘test set’ do not match on some iterations of the cross-validation process. Therefore I have to transform the variables manually, without reducing their levels.

tb <- t[c(1,2)]
tb <- cbind(tb,model.matrix(Sales ~ Promo + StateHoliday + SchoolHoliday + DayOfWeek + as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + as.factor(DateWeek), data = t))

storelm2 <- function(storeNumber) {
  store <- tb[tb$Store == storeNumber,2:ncol(tb)]  # the store number factor variable is removed since it is not needed
  shuffledIndices <- sample(nrow(store))
  store$Prediction <- 0
  z <- nrow(store)
  for (i in 1:10) {
    sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)
    test <- store[shuffledIndices[sampleIndex],]
    train <- store[shuffledIndices[-sampleIndex],1:(ncol(store)-1)]
    modell <- lm(Sales ~ ., train)  # the linear model uses all variables
    store[shuffledIndices[sampleIndex],]$Prediction <- predict(modell,test)
  }

  store[store$Prediction == max(store$Prediction),]$Prediction <- mean(store$Prediction)

  sqrt(mean((store$Sales-store$Prediction)^2))
}

The function is checked to make sure it functions correctly. The prediction for store 1 is no worse than with the first version of the function, and strangely enough, it consistently fails to produce the anomalously high prediction error for store 262 despite doing essentially the same. Execution time is about twice as long but still below one second for each store, which is acceptable.

print(system.time(print(storelm2(1))))

## [1] 612.9252
##    user  system elapsed 
##    0.72    0.11    0.83

print(system.time(print(storelm2(262))))

## [1] 1884.843
##    user  system elapsed 
##    0.78    0.12    0.92

The RMSE scores are calculated for all stores. The total results are clearly lower than the baseline (which was close to 1900.

errors <- rep(0,length(teststores))
j <- 0
for ( i in teststores ) {
    errors[j] <- storelm2(i)
    j <- j + 1
}
mean(errors)

## [1] 1347.681

6. Random Forests

Next I use a random forest model to generate predictions. For some reason that is not clear to me, I was not able to run R’s random forest function on the binary variables, but using the original numeric values for the date components worked fine and produced very good results with a tree number of 200. I tried increasing this value up to 500 but only saw little improvement, while execution time grew significantly, so 200 represents a good compromise between quality and practicality. I abstracted out the modeling function from the cross-validation process that I wrote above for linear regression, storelm, so I can use the same function from now on (one for the numeric and one for the binary version of the data set) with different models. The modeling function is provided to the numericmodel function as an argument.

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

numericmodel <- function(storeNumber,modelType,...) {
  store <- t[t$Store == storeNumber,]
  store$Promo <- as.numeric(store$Promo)
  store$StateHoliday <- as.numeric(store$StateHoliday)
  store$SchoolHoliday <- as.numeric(store$SchoolHoliday)
  store$DateWeek <- as.numeric(store$DateWeek)
  store$DayOfWeek <- as.numeric(store$DayOfWeek)
  shuffledIndices <- sample(nrow(store))
  store$Prediction <- 0
  z <- nrow(store)
  for (i in 1:10) {
    sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)
    test <- store[shuffledIndices[sampleIndex],]
    train <- store[shuffledIndices[-sampleIndex],(2:(ncol(store)-1))]
    modell <- modelType(Sales ~ ., train, ...)
    store[shuffledIndices[sampleIndex],]$Prediction <- predict(modell,test)
  }

  sqrt(mean((store$Sales-store$Prediction)^2))
}

Results for stores 1 and 262 are clearly better than those that were attained using linear models. Execution takes a couple of seconds for each shop.

print(system.time(print(numericmodel(1,randomForest,ntree=200))))

## [1] 568.4792
##    user  system elapsed 
##    2.34    0.03    2.39

print(system.time(print(numericmodel(262,randomForest,ntree=200))))

## [1] 1847.454
##    user  system elapsed 
##    3.14    0.08    3.22

This is also executed for the whole data set in order to compare it to the linear model (and baseline) results.

errors <- rep(0,length(teststores))
j <- 0
for ( i in teststores ) {
    errors[j] <- numericmodel(i,randomForest,ntree=200)
    j <- j + 1
}
mean(errors)

## [1] 909.0821

max(errors)

## [1] 2569.596

The following code is the equivalent function for the binary data set. It produces similar (although slightly worse) results with random forests, but is much slower.

binarymodel <- function(storeNumber,modelType,...) {
  store <- tb[tb$Store == storeNumber,2:ncol(tb)]
  names(store) <- paste("V",c(1:ncol(store)),sep="") # renaming the column names to V1 through V108 because the current names cannot be used in formulae
  names(store)[1] <- "Sales" # the target variable is still called 'Sales'

  shuffledIndices <- sample(nrow(store))
  store$Prediction <- 0
  z <- nrow(store)
  for (i in 1:10) {
    sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)
    test <- store[shuffledIndices[sampleIndex],]
    train <- store[shuffledIndices[-sampleIndex],c(1,3:(ncol(store)-1))]
    modell <- modelType(Sales ~ ., train, ...)
    store[shuffledIndices[sampleIndex],]$Prediction <- predict(modell,test)
  }

  sqrt(mean((store$Sales-store$Prediction)^2))
}

Predicting sales for a single store in this way takes about half a minute, so it is not reasonable to run it on the whole data set.

print(system.time(print(binarymodel(1,randomForest,ntree=200))))

## [1] 601.6405
##    user  system elapsed 
##   25.62    0.06   25.68

I also experimented with a different approach to the training data using random forests. So far I have always divided up the training data by store and constructed the models separately for each store. The reasoning behind this procedure is that the store variable is by far the strongest predictor for the sales, which I have not tried to prove but which seems a very reasonable assumption looking at the descriptive statistics of the data set.

Now I also tried the alternative approach of pooling all shops together and constructing a single model for all of them. The idea here is that changes in sales over time could follow strong general patterns which obviously cannot be established when we look at a single store but which could be captured when modeling the data for all stores in one go. This is clearly a much more resource-intensive solution: computing a model for a single store requires very little memory and is completed in a matter of seconds. Computing one for all stores at once requires gigabytes of memory and takes hours. Unfortunately this approach failed to yield acceptable results. When run on the first 100 stores with a forest size of 200, its execution takes hours but the RMSE for the predictions is slightly higher than the baseline.

7. Neural Networks

The neural network model in R functions differently from the previous two models, so the earlier functions must be modified. I am using the binary version of the data set to train neural network models, and the target variable must be scaled to a value between -1 and 1 for the network to function correctly. Since the binary values are already either 0 or 1, no transformation is required for them. The neuralnet function does not allow the use of the shorthand notation for the formula to be fitted, which is usually expressed as “Sales ~ .”, meaning use all variables to predict sales. Instead, all variables must be explicitly enumerated in the right hand side of the formula, which is achieved by concatenating the column names.

A loop is executed to try different parameter settings for the neural network model function. I am using a single hidden layer, looping over different settings for the number of neurons in this layer between 1 and 15. A second loop is executed within this loop that tries different values for the threshold value that is the stopping criterion for the fitting process. The smaller this value, the longer the fitting runs and the closer the model fits the training data. I have also experimented with larger figures for the number of neurons and smaller values for the thresholds, but these did not improve the RMSE score. As seen in the output for the below process, more neurons and lower thresholds gradually worsen the accuracy of the predictions, which is, I suppose, due to overfitting.

The below code uses the default algorithm for fitting neural networks with the neuralnet function, which is called resilient backpropagation with weight backtracking. I have also experimented with the other algorithms available, but these do not yield significantly different results. The results around 600 that were achieved with 1 neuron for the first store represent a perfectly respectable level of accuracy, about as good as the linear regression model and not much worse than the random forest. However, the neuralnet model also has a very significant drawback: fitting a network just to one store with a threshold of 0.1 (which usually yields acceptable accuracy) takes several seconds, whereas linear models and random forests are fitted almost instantly. This means that fitting models for all stores can take about 2 hours. Lowering the threshold to 0.01 seems to improve accuracy very slightly, but takes about 5 times as long to fit, and is thus not practicable.

library(neuralnet)

## Loading required package: grid

## Loading required package: MASS

storenn <- function(storeNumber,...) {
  store <- tb[tb$Store == storeNumber,2:ncol(tb)]
  names(store) <- paste("V",c(1:108),sep="") # renaming the column names to V1 through V108 because the current names cannot be used in formulae
  names(store)[1] <- "Sales" # the target variable is still called 'Sales'
  
  for (hid in c(1,3)) {   # This specifies the number of neurons in the hidden layer
    for (trs in c(0.5,0.3,0.1,0.03)) {  # This is the threshold value for the partial derivatives of the error function used as stopping criteria
      shuffledData <- store[sample(nrow(store)),]
      shuffledData$Prediction <- 0
      
      ### Here the sales data are scaled to values between -1 and 1
      m1 <- mean(shuffledData$Sales)
      shuffledData$Sales <- shuffledData$Sales - m1
      m2 <- max(abs(shuffledData$Sales))
      shuffledData$Sales <- shuffledData$Sales / m2

      z <- nrow(store)
      for (i in 1:10) {
        ind <- 1:z
        sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
        test <- shuffledData[sampleIndex,3:(ncol(shuffledData)-1)]
        train <- shuffledData[-sampleIndex,c(1,3:(ncol(shuffledData)-1))]
        n <- names(train)
        f <- as.formula(paste("Sales ~", paste(n[n != "Sales"], collapse = " + ")))
        modell <- neuralnet(f, train, hidden=hid,thresh=trs)
        c <- compute(modell,test)
        c$net.result <- c$net.result * m2 + m1
        shuffledData[sampleIndex,]$Prediction <- c$net.result
        
      }
      
      shuffledData$Sales <- shuffledData$Sales * m2 + m1
      
      print(system.time(print(paste("hidden=",hid,"thresh=",trs,"RMSE=",sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))))))
    }
  }
}

storenn(1)

## [1] "hidden= 1 thresh= 0.5 RMSE= 631.150302011459"
##    user  system elapsed 
##       0       0       0 
## [1] "hidden= 1 thresh= 0.3 RMSE= 615.010357425616"
##    user  system elapsed 
##       0       0       0 
## [1] "hidden= 1 thresh= 0.1 RMSE= 653.620989227642"
##    user  system elapsed 
##       0       0       0 
## [1] "hidden= 1 thresh= 0.03 RMSE= 623.056264140328"
##    user  system elapsed 
##       0       0       0 
## [1] "hidden= 3 thresh= 0.5 RMSE= 666.2249655126"
##    user  system elapsed 
##       0       0       0 
## [1] "hidden= 3 thresh= 0.3 RMSE= 652.099525460725"
##    user  system elapsed 
##       0       0       0 
## [1] "hidden= 3 thresh= 0.1 RMSE= 632.071316702847"
##    user  system elapsed 
##       0       0       0 
## [1] "hidden= 3 thresh= 0.03 RMSE= 748.206699380426"
##    user  system elapsed 
##       0       0       0

The total RMSE using neural networks for the whole data set is calculated with the following code. As expected, the results are not particularly satisfactory.

storenn2 <- function(storeNumber,...) {
  store <- tb[tb$Store == storeNumber,2:ncol(tb)]
  names(store) <- paste("V",c(1:108),sep="")
  names(store)[1] <- "Sales"
  
  shuffledData <- store[sample(nrow(store)),]
  shuffledData$Prediction <- 0
  
  m1 <- mean(shuffledData$Sales)
  shuffledData$Sales <- shuffledData$Sales - m1
  m2 <- max(abs(shuffledData$Sales))
  shuffledData$Sales <- shuffledData$Sales / m2

  z <- nrow(store)
  for (i in 1:10) {
    ind <- 1:z
    sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
    test <- shuffledData[sampleIndex,3:(ncol(shuffledData)-1)]
    train <- shuffledData[-sampleIndex,c(1,3:(ncol(shuffledData)-1))]
    n <- names(train)
    f <- as.formula(paste("Sales ~", paste(n[n != "Sales"], collapse = " + ")))
    modell <- neuralnet(f, train, ...)
    c <- compute(modell,test)
    c$net.result <- c$net.result * m2 + m1
    shuffledData[sampleIndex,]$Prediction <- c$net.result
  }
  
  shuffledData$Sales <- shuffledData$Sales * m2 + m1
  
  sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}

errors <- rep(0,max(as.numeric(t$Store)))
for ( i in 1:max(as.numeric(t$Store)) ) {
    errors[i] <- storenn2(i, hidden=1,thresh=0.1)
}
mean(errors)
max(errors)

8. Support Vector Machines

8.1. Fitting to binary data

TR’s support vector machine implementation fitted to the binary data delivers results that are about as accurate using the best combination of parameters (in the below case, cost = 30 and gamma = 0.001) as those generated by random forests. However, it runs much slower and takes roughly 5 to 10 seconds to converge.

library(e1071)
for (cst in c(30000,3000,300,30)) {
    for (gm in c(0.1,0.01,0.001)) {
        print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",binarymodel(262,svm,cost=cst,gamma=gm)))))
    }
}

## [1] "cost= 30000 gamma= 0.1 RMSE= 4635.56941656794"
##    user  system elapsed 
##    8.24    0.08    8.33 
## [1] "cost= 30000 gamma= 0.01 RMSE= 2329.39631254082"
##    user  system elapsed 
##    7.60    0.06    7.69 
## [1] "cost= 30000 gamma= 0.001 RMSE= 2008.14839687849"
##    user  system elapsed 
##   12.06    0.06   12.12 
## [1] "cost= 3000 gamma= 0.1 RMSE= 4639.68953226739"
##    user  system elapsed 
##    8.27    0.04    8.30 
## [1] "cost= 3000 gamma= 0.01 RMSE= 2330.94323499704"
##    user  system elapsed 
##    7.53    0.07    7.61 
## [1] "cost= 3000 gamma= 0.001 RMSE= 1952.40151632215"
##    user  system elapsed 
##   11.93    0.03   11.98 
## [1] "cost= 300 gamma= 0.1 RMSE= 4640.94882845657"
##    user  system elapsed 
##    8.30    0.10    8.39 
## [1] "cost= 300 gamma= 0.01 RMSE= 2337.58645068642"
##    user  system elapsed 
##    7.62    0.01    7.63 
## [1] "cost= 300 gamma= 0.001 RMSE= 1882.74394288866"
##    user  system elapsed 
##   11.29    0.10   11.39 
## [1] "cost= 30 gamma= 0.1 RMSE= 4633.19788001548"
##    user  system elapsed 
##    8.26    0.08    8.35 
## [1] "cost= 30 gamma= 0.01 RMSE= 2366.35393267501"
##    user  system elapsed 
##    7.55    0.04    7.61 
## [1] "cost= 30 gamma= 0.001 RMSE= 1827.67286903152"
##    user  system elapsed 
##    7.78    0.11    7.89

This function is used in loops to establish the best combination of parameters. Unfortunately it turns out that the parameters that are ideal for some stores only yield average quality predictions for others and vice versa. For example, the below examples show that a gamma of 0.001 together with a C value between 10 and 100 results in the best predictions for store 262 (a store with high sales), whereas store 1 works best with C = 10000 and gamma = 0.01. Thus using these SVMs would probably require determining separate C and gamma values for each store and constructing the models for each store based on these different parameters. However, the quality of the SVM-based predictions is not high enough to make this worthwhile. Furthermore, training for some stores is rather slow.

for (cst in c(10000,3000,1000,300,100,30,10)) {
     for (gm in c(0.03,0.01,0.003,0.001)) {
         print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",binarymodel(262,svm,cost=cst,gamma=gm)))))
     }
}

## [1] "cost= 10000 gamma= 0.03 RMSE= 4009.79422759253"
##    user  system elapsed 
##    8.33    0.06    8.39 
## [1] "cost= 10000 gamma= 0.01 RMSE= 2324.08524059366"
##    user  system elapsed 
##    7.57    0.09    7.68 
## [1] "cost= 10000 gamma= 0.003 RMSE= 1982.21652125073"
##    user  system elapsed 
##    8.38    0.00    8.40 
## [1] "cost= 10000 gamma= 0.001 RMSE= 1938.88806809072"
##    user  system elapsed 
##   12.09    0.05   12.14 
## [1] "cost= 3000 gamma= 0.03 RMSE= 3988.96976475929"
##    user  system elapsed 
##    8.22    0.06    8.28 
## [1] "cost= 3000 gamma= 0.01 RMSE= 2313.02805962884"
##    user  system elapsed 
##    7.60    0.05    7.66 
## [1] "cost= 3000 gamma= 0.003 RMSE= 1951.02806194563"
##    user  system elapsed 
##    8.33    0.08    8.44 
## [1] "cost= 3000 gamma= 0.001 RMSE= 1997.87421755011"
##    user  system elapsed 
##   12.15    0.06   12.28 
## [1] "cost= 1000 gamma= 0.03 RMSE= 4014.93475227269"
##    user  system elapsed 
##    8.22    0.08    8.30 
## [1] "cost= 1000 gamma= 0.01 RMSE= 2311.3030130797"
##    user  system elapsed 
##    7.47    0.11    7.58 
## [1] "cost= 1000 gamma= 0.003 RMSE= 1952.12609018553"
##    user  system elapsed 
##    8.32    0.07    8.39 
## [1] "cost= 1000 gamma= 0.001 RMSE= 1944.82550648585"
##    user  system elapsed 
##   12.03    0.08   12.11 
## [1] "cost= 300 gamma= 0.03 RMSE= 3999.97142692386"
##    user  system elapsed 
##    8.22    0.03    8.27 
## [1] "cost= 300 gamma= 0.01 RMSE= 2266.4608804984"
##    user  system elapsed 
##    7.59    0.02    7.61 
## [1] "cost= 300 gamma= 0.003 RMSE= 1918.97556971088"
##    user  system elapsed 
##    8.27    0.06    8.33 
## [1] "cost= 300 gamma= 0.001 RMSE= 1987.77042418125"
##    user  system elapsed 
##   11.39    0.06   11.45 
## [1] "cost= 100 gamma= 0.03 RMSE= 3975.40898905443"
##    user  system elapsed 
##    8.19    0.08    8.26 
## [1] "cost= 100 gamma= 0.01 RMSE= 2363.30100177786"
##    user  system elapsed 
##    7.52    0.08    7.61 
## [1] "cost= 100 gamma= 0.003 RMSE= 1953.59952188883"
##    user  system elapsed 
##    8.31    0.11    8.42 
## [1] "cost= 100 gamma= 0.001 RMSE= 1879.98943384919"
##    user  system elapsed 
##    9.78    0.05    9.83 
## [1] "cost= 30 gamma= 0.03 RMSE= 4027.71974430893"
##    user  system elapsed 
##    8.24    0.07    8.32 
## [1] "cost= 30 gamma= 0.01 RMSE= 2338.60640939273"
##    user  system elapsed 
##    7.60    0.04    7.63 
## [1] "cost= 30 gamma= 0.003 RMSE= 1959.45634773481"
##    user  system elapsed 
##    8.22    0.07    8.31 
## [1] "cost= 30 gamma= 0.001 RMSE= 1808.49837042223"
##    user  system elapsed 
##    7.84    0.03    7.90 
## [1] "cost= 10 gamma= 0.03 RMSE= 4032.16279240566"
##    user  system elapsed 
##    8.22    0.08    8.30 
## [1] "cost= 10 gamma= 0.01 RMSE= 2322.41934054244"
##    user  system elapsed 
##    7.60    0.08    7.68 
## [1] "cost= 10 gamma= 0.003 RMSE= 1908.31459364646"
##    user  system elapsed 
##    7.63    0.05    7.70 
## [1] "cost= 10 gamma= 0.001 RMSE= 1827.86932830604"
##    user  system elapsed 
##    7.15    0.04    7.19

for (cst in c(10000,3000,1000,300,100,30,10)) {
     for (gm in c(0.03,0.01,0.003,0.001)) {
         print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",binarymodel(1,svm,cost=cst,gamma=gm)))))
     }
}

## [1] "cost= 10000 gamma= 0.03 RMSE= 581.142879580527"
##    user  system elapsed 
##    4.48    0.02    4.49 
## [1] "cost= 10000 gamma= 0.01 RMSE= 604.978584459808"
##    user  system elapsed 
##    3.26    0.03    3.29 
## [1] "cost= 10000 gamma= 0.003 RMSE= 635.546530810994"
##    user  system elapsed 
##    3.06    0.05    3.10 
## [1] "cost= 10000 gamma= 0.001 RMSE= 702.374448471912"
##    user  system elapsed 
##    2.95    0.03    2.98 
## [1] "cost= 3000 gamma= 0.03 RMSE= 590.872000276642"
##    user  system elapsed 
##    3.22    0.03    3.26 
## [1] "cost= 3000 gamma= 0.01 RMSE= 641.037405812469"
##    user  system elapsed 
##    3.01    0.09    3.11 
## [1] "cost= 3000 gamma= 0.003 RMSE= 718.242021226803"
##    user  system elapsed 
##    2.90    0.03    2.93 
## [1] "cost= 3000 gamma= 0.001 RMSE= 828.126905711493"
##    user  system elapsed 
##    2.82    0.05    2.88 
## [1] "cost= 1000 gamma= 0.03 RMSE= 643.119698670216"
##    user  system elapsed 
##    3.04    0.05    3.09 
## [1] "cost= 1000 gamma= 0.01 RMSE= 710.254417758907"
##    user  system elapsed 
##    2.93    0.05    2.98 
## [1] "cost= 1000 gamma= 0.003 RMSE= 833.632354371629"
##    user  system elapsed 
##    2.84    0.01    2.86 
## [1] "cost= 1000 gamma= 0.001 RMSE= 930.955592008881"
##    user  system elapsed 
##    2.85    0.02    2.87 
## [1] "cost= 300 gamma= 0.03 RMSE= 731.563947877744"
##    user  system elapsed 
##    2.90    0.01    2.92 
## [1] "cost= 300 gamma= 0.01 RMSE= 837.389158289238"
##    user  system elapsed 
##    2.84    0.02    2.86 
## [1] "cost= 300 gamma= 0.003 RMSE= 940.627565613757"
##    user  system elapsed 
##    2.85    0.01    2.87 
## [1] "cost= 300 gamma= 0.001 RMSE= 990.310851555043"
##    user  system elapsed 
##    2.82    0.03    2.86 
## [1] "cost= 100 gamma= 0.03 RMSE= 852.952630309809"
##    user  system elapsed 
##    2.83    0.02    2.84 
## [1] "cost= 100 gamma= 0.01 RMSE= 935.742781446738"
##    user  system elapsed 
##    2.81    0.03    2.84 
## [1] "cost= 100 gamma= 0.003 RMSE= 990.364799384578"
##    user  system elapsed 
##    2.86    0.00    2.85 
## [1] "cost= 100 gamma= 0.001 RMSE= 1010.5041221811"
##    user  system elapsed 
##    2.78    0.08    2.86 
## [1] "cost= 30 gamma= 0.03 RMSE= 954.632985113189"
##    user  system elapsed 
##    2.83    0.03    2.86 
## [1] "cost= 30 gamma= 0.01 RMSE= 992.010400651147"
##    user  system elapsed 
##    2.81    0.05    2.85 
## [1] "cost= 30 gamma= 0.003 RMSE= 1009.40403678135"
##    user  system elapsed 
##    2.81    0.04    2.85 
## [1] "cost= 30 gamma= 0.001 RMSE= 1016.47614714052"
##    user  system elapsed 
##    2.81    0.04    2.84 
## [1] "cost= 10 gamma= 0.03 RMSE= 995.431099321153"
##    user  system elapsed 
##    2.84    0.01    2.86 
## [1] "cost= 10 gamma= 0.01 RMSE= 1009.68111754956"
##    user  system elapsed 
##    2.77    0.06    2.84 
## [1] "cost= 10 gamma= 0.003 RMSE= 1015.92927942376"
##    user  system elapsed 
##    2.80    0.03    2.84 
## [1] "cost= 10 gamma= 0.001 RMSE= 1018.15585025683"
##    user  system elapsed 
##    2.80    0.04    2.84

8.2. Fitting to scaled numeric data

Although the SVM algorithm does not converge on the numeric version of the training data, I have experimented with a scaled version of these data using the code below.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

scaledmodel <- function(storeNumber,modelType,...) {
  store <- t[t$Store == storeNumber,2:10]
  
  shuffledData <- store[sample(nrow(store)),]
  shuffledData$Prediction <- 0
  
  m1 <- mean(shuffledData$Sales)
  shuffledData$Sales <- shuffledData$Sales - m1
  m2 <- max(abs(shuffledData$Sales))
  shuffledData$Sales <- shuffledData$Sales / m2

  pp <- preProcess(store[-1])
  shuffledData <- predict(pp,shuffledData)

  z <- nrow(store)
  for (i in 1:10) {
    ind <- 1:z
    sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
    test <- shuffledData[sampleIndex,3:(ncol(shuffledData)-1)]
    train <- shuffledData[-sampleIndex,c(1,3:(ncol(shuffledData)-1))]
    n <- names(train)
    f <- as.formula(paste("Sales ~", paste(n[n != "Sales"], collapse = " + ")))
    modell <- modelType(f, train, ...)
    shuffledData[sampleIndex,]$Prediction <- predict(modell,test) * m2 + m1
  }
  shuffledData$Sales <- shuffledData$Sales * m2 + m1
  sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}

print(system.time(print(scaledmodel(1,svm))))

## [1] 891.551692
##    user  system elapsed 
##    1.84    0.02    1.86

This does in fact converge and even helped to reduce the execution time for SVMs to around 3 seconds, but the predictions are much less accurate than with the binary data. I have used smaller C values in this case than with the binary model because when I tried values as high as 10000, the algorithm took extremely long (as long as 8 minutes depending on the gamma value) to converge while yielding no better results.

for (cst in c(300,100,30,10,1)) {
    for (gm in c(0.1,0.03,0.01,0.003,0.001)) {
        print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",scaledmodel(262,svm,cost=cst,gamma=gm)))))
    }
}

## [1] "cost= 300 gamma= 0.1 RMSE= 2593.64232101338"
##    user  system elapsed 
##   32.61    0.00   32.60 
## [1] "cost= 300 gamma= 0.03 RMSE= 2296.93245132956"
##    user  system elapsed 
##   10.96    0.02   10.99 
## [1] "cost= 300 gamma= 0.01 RMSE= 2340.51431946316"
##    user  system elapsed 
##    5.07    0.02    5.09 
## [1] "cost= 300 gamma= 0.003 RMSE= 2450.27204333763"
##    user  system elapsed 
##    3.23    0.00    3.23 
## [1] "cost= 300 gamma= 0.001 RMSE= 2514.59037761978"
##    user  system elapsed 
##    2.68    0.00    2.70 
## [1] "cost= 100 gamma= 0.1 RMSE= 2475.16858490888"
##    user  system elapsed 
##   11.50    0.01   11.51 
## [1] "cost= 100 gamma= 0.03 RMSE= 2280.36807263334"
##    user  system elapsed 
##    5.32    0.02    5.33 
## [1] "cost= 100 gamma= 0.01 RMSE= 2347.97891652388"
##    user  system elapsed 
##    3.38    0.00    3.39 
## [1] "cost= 100 gamma= 0.003 RMSE= 2483.69624430045"
##    user  system elapsed 
##    2.64    0.00    2.63 
## [1] "cost= 100 gamma= 0.001 RMSE= 2548.64205748948"
##    user  system elapsed 
##    2.61    0.00    2.61 
## [1] "cost= 30 gamma= 0.1 RMSE= 2316.25440194878"
##    user  system elapsed 
##    5.12    0.00    5.14 
## [1] "cost= 30 gamma= 0.03 RMSE= 2264.54892406434"
##    user  system elapsed 
##    3.23    0.00    3.23 
## [1] "cost= 30 gamma= 0.01 RMSE= 2407.45275042331"
##    user  system elapsed 
##    2.66    0.00    2.66 
## [1] "cost= 30 gamma= 0.003 RMSE= 2510.30074596213"
##    user  system elapsed 
##    2.50    0.01    2.52 
## [1] "cost= 30 gamma= 0.001 RMSE= 2805.74003619912"
##    user  system elapsed 
##    2.45    0.00    2.46 
## [1] "cost= 10 gamma= 0.1 RMSE= 2360.49558725221"
##    user  system elapsed 
##    3.23    0.03    3.28 
## [1] "cost= 10 gamma= 0.03 RMSE= 2306.94297156483"
##    user  system elapsed 
##    2.67    0.04    2.70 
## [1] "cost= 10 gamma= 0.01 RMSE= 2495.09428331857"
##    user  system elapsed 
##    2.51    0.00    2.51 
## [1] "cost= 10 gamma= 0.003 RMSE= 2756.7848053233"
##    user  system elapsed 
##    2.45    0.00    2.45 
## [1] "cost= 10 gamma= 0.001 RMSE= 3121.15195949112"
##    user  system elapsed 
##    2.36    0.00    2.35 
## [1] "cost= 1 gamma= 0.1 RMSE= 2404.14928086344"
##    user  system elapsed 
##    2.53    0.00    2.53 
## [1] "cost= 1 gamma= 0.03 RMSE= 2727.48736148838"
##    user  system elapsed 
##    2.39    0.00    2.38 
## [1] "cost= 1 gamma= 0.01 RMSE= 3104.95917566333"
##    user  system elapsed 
##    2.32    0.00    2.33 
## [1] "cost= 1 gamma= 0.003 RMSE= 3846.94688442779"
##    user  system elapsed 
##    2.36    0.00    2.36 
## [1] "cost= 1 gamma= 0.001 RMSE= 4404.55866863145"
##    user  system elapsed 
##    2.37    0.01    2.39

Therefore I will only use the basic binary model function to train SVMs.

8.3. Using different kernels

The above models used the default kernel for SVMs in R, which is the radial basis kernel. I have also experimented with the other available kernels. Pretty good predictions are generated using linear kernels when C is set relatively low (to 1 or below). Execution time is prohibitively long when C is close to 1 or even higher).

for (cst in c(1,0.5,0.1,0.05,0.01,0.003,0.001)) {
   print(system.time(print(paste("cost=",cst,"RMSE=",binarymodel(262,svm,cost=cst,kernel="linear")))))
}

## [1] "cost= 1 RMSE= 1844.65110177669"
##    user  system elapsed 
##   41.03    0.13   41.15 
## [1] "cost= 0.5 RMSE= 1862.54272709356"
##    user  system elapsed 
##   23.26    0.06   23.33 
## [1] "cost= 0.1 RMSE= 1850.25702287523"
##    user  system elapsed 
##    9.02    0.08    9.11 
## [1] "cost= 0.05 RMSE= 1839.61052805445"
##    user  system elapsed 
##    7.13    0.04    7.22 
## [1] "cost= 0.01 RMSE= 1854.77693246435"
##    user  system elapsed 
##    5.72    0.05    5.77 
## [1] "cost= 0.003 RMSE= 2184.88577928492"
##    user  system elapsed 
##    5.60    0.06    5.69 
## [1] "cost= 0.001 RMSE= 3112.03226993571"
##    user  system elapsed 
##    5.55    0.05    5.61

For other stores, however, predictions based on linear kernels are worse than average.

for (cst in c(1,0.5,0.1,0.05,0.01,0.003,0.001)) {
   print(system.time(print(paste("cost=",cst,"RMSE=",binarymodel(1,svm,cost=cst,kernel="linear")))))
}

## [1] "cost= 1 RMSE= 970.148233347503"
##    user  system elapsed 
##    1.61    0.03    1.66 
## [1] "cost= 0.5 RMSE= 995.299056098023"
##    user  system elapsed 
##    1.60    0.05    1.66 
## [1] "cost= 0.1 RMSE= 1014.28182471904"
##    user  system elapsed 
##    1.59    0.06    1.66 
## [1] "cost= 0.05 RMSE= 1017.41535968184"
##    user  system elapsed 
##    1.61    0.02    1.62 
## [1] "cost= 0.01 RMSE= 1019.64941504274"
##    user  system elapsed 
##    1.59    0.06    1.67 
## [1] "cost= 0.003 RMSE= 1021.03898011506"
##    user  system elapsed 
##    1.56    0.08    1.64 
## [1] "cost= 0.001 RMSE= 1019.74808959326"
##    user  system elapsed 
##    1.59    0.04    1.63

I have not been able to achieve competitive results with polynomial kernels under any settings, so I will not discuss them here.

9. Incorporating Competition and Promo2 dates

As I mentioned when presenting an overview of the data, there is a second data table called store.csv which contains information on the type of store, whether there is a competition store close to it, and whether the store participates in a promotion programme. This is a typical row from that table.

s[1,]

##   Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth
## 1     1         c          a                1270                         9
##   CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear
## 1                     2008  FALSE              NA              NA
##   PromoInterval
## 1

I assume that certain pieces of information included here are certainly redundant and less specific than the data available from the training table. No doubt the StoreType, CompetitionDistance and Assortment variables have an effect on the sales of a store. Theoretically, these could be used to impute sales for stores that we have little historical data on. However, this will not be necessary since we seem to have sufficient historical data for all stores. The opening date of the competition store and the start of the second type of promo might have some effect on sales. In order to get an impression of whether this effect indeed exists, I create a new logical variable for the existence of a competition store and participation in the second promotion respectively, and then set this variable to true for all dates after the starting dates indicated in the stores table.

t$Competition <- F
t$Promo2 <- F

for (i in 1:nrow(s)) {
    if (!is.na(s[i,5])) {
        m <- s[i,5]
        y <- s[i,6] - 2000
        t[t$Store == i,]$Competition <- t[t$Store == i,]$DateYear > y | (t[t$Store == i,]$DateMonth >= m &  t[t$Store == i,]$DateYear == y)
    }
    if (s[i,7]) {
        w <- s[i,8]
        y <- s[i,9] - 2000
        t[t$Store == i,]$Promo2 <- t[t$Store == i,]$DateYear > y | (t[t$Store == i,]$DateWeek >= w & t[t$Store == i,]$DateYear == y)
    }
}

Next I check how these two variables affect sales.

t.test(t[t$Promo2,]$Sales,t[!t$Promo2,]$Sales)

## 
##  Welch Two Sample t-test
## 
## data:  t[t$Promo2, ]$Sales and t[!t$Promo2, ]$Sales
## t = -81.052822, df = 633758.63, p-value < 0.00000000000000022204
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -605.1260330 -576.5514605
## sample estimates:
##   mean of x   mean of y 
## 6548.138444 7138.977191

Unexpectedly, participation of the store in the second type of promotions is correlated with drastically lower sales (by about 10%). The difference is highly significant.

t.test(t[t$Competition,]$Sales,t[!t$Competition,]$Sales)

## 
##  Welch Two Sample t-test
## 
## data:  t[t$Competition, ]$Sales and t[!t$Competition, ]$Sales
## t = -24.192057, df = 597683.09, p-value < 0.00000000000000022204
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -195.0011589 -165.7723695
## sample estimates:
##   mean of x   mean of y 
## 6774.203593 6954.590358

Of our binary variables, Competition correlates least strongly with differences in sales (the t value is closest to 0 here). Although the effect that in the presence of competition sales are lower is statistically highly significant, the magnitude of this difference is negligible (about 1 per cent).

Having added these further variables to our training data, I test the new data on the usual two stores using a linear model.

storelm2(1)

## [1] 602.7654652

storelm2(262)

## [1] 2863.729623

numericmodel(1,randomForest,ntree=200)

## [1] 649.189655

numericmodel(262,randomForest,ntree=200)

## [1] 2081.333607

I retest the predictions. As expected, we see a slight improvement in the average errors per store using random forests.

errors <- rep(0,length(teststores))
j <- 0
for ( i in teststores ) {
    errors[j] <- numericmodel(i,randomForest,ntree=200)
    j <- j + 1
}
mean(errors)

## [1] 1011.349904

max(errors)

## [1] 2749.477975

Since these two further variables lead to slightly more accurate predictions, I include them in the binary version of the data as well.

tb$Promo2TRUE <- as.numeric(t$Promo2)
tb$CompetitionTRUE <- as.numeric(t$Competition)

10. Dimensionality Reduction with Principal Component Analysis

It is worth exploring whether PCA could lead to an improvement of either accuracy or at least processing speed. Let us examine first the contribution that the principal components make to the data, using the binary version of the data. This version of the data set has 108 predictive variables. The rank of the data matrix is 107.

store <- tb[tb$Store == 262,4:ncol(tb)]
ncol(store)

## [1] 108

library(Matrix)
qr(store)$rank

## [1] 107

Above I have already selected a particular store (262) for the PCA. First I plot the standard deviations that the ordered principal components account for in the data.

pc <- prcomp(store)
plot(pc$sdev,type="l", xlab="Principal Components", ylab="Standard Deviations")

The first four components are far higher in the first plot than all others, and the standard deviations fall sharply until we get to around component 20. From then on the standard deviations are relatively flat to about component 80, where we again see them start to drop.

Second I compute a cumulative sum of these standard deviations, divide these by the total sum of standard deviations and plot the result.

plot(cumsum(pc$sdev / sum(pc$sdev)),type="l", xlab="Principal Components", ylab="Cumulative Standard Deviation Ratio")

This plot shows an almost linear curve. The cumulative contribution of the components rises steadily throughout the curve and only flattens out around the end. This indicates that there is no obvious point that would suggest itself as an ideal choice for dimensionality reduction. It also suggests that we are unlikely to be able to reduce the dimensionality of the data and thereby gain improvements in processing speed without sacrificing accuracy, since basically all but the last few components add to the cumulative curve. Thus, even though the first 20 or so components should be sufficient for rough estimates of sales, we should see a gradual but tangible improvement of prediction accuracy up to the last components. To confirm this, I prepare a version of the storelm2 function which I used to predict sales using a linear model above. This new version incorporates PCA with dimension reduction, the second argument being the dimension to which we reduce the data.

lmwithpca <- function(storeNumber,dim) {
  store <- tb[tb$Store == storeNumber,2:ncol(tb)]  # the store number factor variable is removed since it is not needed
  shuffledData <- store[sample(nrow(store)),]
  shuffledData$Prediction <- 0
  z <- nrow(store)
  for (i in 1:10) {
    ind <- 1:z
    sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
    test <- shuffledData[sampleIndex,]
    train <- shuffledData[-sampleIndex,1:(ncol(shuffledData)-1)]
    trank <- qr(train[3:ncol(train)])$rank
    pc <- prcomp(train[3:ncol(train)])
    pctrain <- cbind(train[1],pc$x)[1:min(trank,dim)]  # The dimension is reduced to dim, but if the rank of the training data is lower, then it is reduced to that rank. This takes care of the rank-deficient data problem that arose in earlier versions of this function.
    pctest <- as.data.frame(predict(pc,test[3:ncol(test)-1]))[1:min(trank,dim)]
    modell <- lm(Sales ~ ., pctrain)
    shuffledData[sampleIndex,]$Prediction <- predict(modell,pctest)
  }

  sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}

I first calculate a baseline RMSE using the original function without pca. Then I loop the new function with PCA with dimensions set 9 to 99 with steps of 10, and plot the resulting RMSE scores.

print(system.time(print(baseline <- storelm2(262))))

## [1] 2500.464874
##    user  system elapsed 
##    0.82    0.00    0.82

results <- rep(1,10)
for (i in seq(9,99,10)) {
  print(system.time(print(results[(i+1)/10] <- lmwithpca(262,i))))
}

## [1] 3553.404564
##    user  system elapsed 
##    1.15    0.03    1.20 
## [1] 2901.212728
##    user  system elapsed 
##    1.15    0.03    1.20 
## [1] 2755.764757
##    user  system elapsed 
##    1.28    0.04    1.36 
## [1] 2791.883354
##    user  system elapsed 
##    1.14    0.08    1.22 
## [1] 2802.954744
##    user  system elapsed 
##    1.17    0.09    1.26 
## [1] 2815.983384
##    user  system elapsed 
##    1.27    0.02    1.29 
## [1] 2803.769114
##    user  system elapsed 
##    1.27    0.08    1.34 
## [1] 2700.634346
##    user  system elapsed 
##    1.33    0.06    1.40 
## [1] 2322.162164
##    user  system elapsed 
##    1.34    0.05    1.38 
## [1] 1911.203792
##    user  system elapsed 
##    1.40    0.04    1.46

plot(seq(9,99,10),results)
points(c(9,99),c(baseline,baseline),"l",col="green")

Although the process is randomised and will differ slightly each time, a typical result is a curve with the RMSE starting out very high with 9 principal components, dropping sharply in the next step at 19, staying about level until 79, and then again dropping sharply for 89 and 99. I have also checked values for 99 to 107, but there is no further consistent gain in that range.

However, what is interesting is the position of the baseline in relation to the PCA-generated data. The flat part of the curve runs at around the same height but usually below the baseline, whereas the high-dimensional (89 and 99) approximations of the data that we get through PCA are far better than the baseline.

Therefore I construct an abstract function that first transforms the data by using PCA and then trains any model on it to generate predictions for stores

modelwithpca <- function(storeNumber,modelType,dim=100,...) {
  store <- tb[tb$Store == storeNumber,2:ncol(tb)]  # the store number factor variable is removed since it is not needed
  shuffledData <- store[sample(nrow(store)),]
  shuffledData$Prediction <- 0
  z <- nrow(store)
  for (i in 1:10) {
    ind <- 1:z
    sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
    test <- shuffledData[sampleIndex,]
    train <- shuffledData[-sampleIndex,1:(ncol(shuffledData)-1)]
    trank <- qr(train[3:ncol(train)])$rank
    pc <- prcomp(train[3:ncol(train)])
    pctrain <- cbind(train[1],pc$x)[1:min(trank,dim)]
    pctest <- as.data.frame(predict(pc,test[3:ncol(test)-1]))[1:min(trank,dim)]
    modell <- modelType(Sales ~ ., pctrain,...)
    shuffledData[sampleIndex,]$Prediction <- predict(modell,pctest)
  }

  sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}

Using this function the errors for the complete range of stores could be calculated like we did above, although with this particular model the results would certainly be good but not exceptionally so.

This function could in principle be used with random forests, but this does not yield good results. The binary data delivered somewhat worse error scores than the excellent (and so far best) results gained with random forests on numeric data. Also, running the function takes extremely long despite the rather mediocre results.

print(system.time(print(modelwithpca(262,randomForest,100,ntree=100))))

## [1] 2252.886994
##    user  system elapsed 
##   30.25    0.03   30.73

I have also created a modified version of the PCA-based function that uses numeric data. However, this function combined with random forests delivers far worse results than the simple randomForest function without PCA.

numericwithpca <- function(storeNumber,modelType,dim=10,...) {
  store <- t[t$Store == storeNumber,2:ncol(t)]  # the store number factor variable is removed since it is not needed
  store$Promo <- as.numeric(store$Promo)
  store$StateHoliday <- as.numeric(store$StateHoliday)
  store$SchoolHoliday <- as.numeric(store$SchoolHoliday)
  store$DateWeek <- as.numeric(store$DateWeek)
  store$DayOfWeek <- as.numeric(store$DayOfWeek)
  shuffledData <- store[sample(nrow(store)),]
  shuffledData$Prediction <- 0
  z <- nrow(store)
  for (i in 1:10) {
    ind <- 1:z
    sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
    test <- shuffledData[sampleIndex,2:(ncol(shuffledData)-1)]
    train <- shuffledData[-sampleIndex,1:(ncol(shuffledData)-1)]
    trank <- qr(train[2:ncol(train)])$rank
    pc <- prcomp(train[2:ncol(train)])
    pctrain <- cbind(train[1],pc$x)[1:min(trank,dim)]
    pctest <- as.data.frame(predict(pc,test))[1:min(trank,dim)]
    modell <- modelType(Sales ~ ., pctrain)
    shuffledData[sampleIndex,]$Prediction <- predict(modell,pctest)
  }

  sqrt(mean((shuffledData$Sales-shuffledData$Prediction)^2))
}
numericwithpca(262,randomForest,ntree=200)

## [1] 2404.263708

numericmodel(262,randomForest,ntree=200)

## [1] 2157.920224

At the same time, transforming the data with PCA significantly improves results with SVMs over the results gained with the binarymodel function.

for (cst in c(300,100,30,10)) {
    for (gm in c(0.003,0.001)) {
        print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",modelwithpca(1,svm,cost=cst,gamma=gm)))))
    }
}

## [1] "cost= 300 gamma= 0.003 RMSE= 538.169490760043"
##    user  system elapsed 
##    7.52    0.12    7.76 
## [1] "cost= 300 gamma= 0.001 RMSE= 553.180007040596"
##    user  system elapsed 
##   10.17    0.09   10.32 
## [1] "cost= 100 gamma= 0.003 RMSE= 537.645840475199"
##    user  system elapsed 
##    7.64    0.07    7.85 
## [1] "cost= 100 gamma= 0.001 RMSE= 557.760186938175"
##    user  system elapsed 
##    8.44    0.04    8.67 
## [1] "cost= 30 gamma= 0.003 RMSE= 537.796729867815"
##    user  system elapsed 
##    7.31    0.03    7.47 
## [1] "cost= 30 gamma= 0.001 RMSE= 589.611565644862"
##    user  system elapsed 
##    7.01    0.05    7.14 
## [1] "cost= 10 gamma= 0.003 RMSE= 546.482623881485"
##    user  system elapsed 
##    6.67    0.03    6.76 
## [1] "cost= 10 gamma= 0.001 RMSE= 610.304247978315"
##    user  system elapsed 
##    6.19    0.00    6.19

for (cst in c(300,100,30,10)) {
    for (gm in c(0.003,0.001)) {
        print(system.time(print(paste("cost=",cst,"gamma=",gm,"RMSE=",modelwithpca(262,svm,cost=cst,gamma=gm)))))
    }
}

## [1] "cost= 300 gamma= 0.003 RMSE= 1852.57624384181"
##    user  system elapsed 
##    8.97    0.10    9.20 
## [1] "cost= 300 gamma= 0.001 RMSE= 1826.00936487394"
##    user  system elapsed 
##   12.42    0.06   12.80 
## [1] "cost= 100 gamma= 0.003 RMSE= 1906.82882216824"
##    user  system elapsed 
##    9.69    0.02   10.03 
## [1] "cost= 100 gamma= 0.001 RMSE= 1803.28632013106"
##    user  system elapsed 
##   10.92    0.03   11.12 
## [1] "cost= 30 gamma= 0.003 RMSE= 1877.85988409389"
##    user  system elapsed 
##    9.06    0.02    9.37 
## [1] "cost= 30 gamma= 0.001 RMSE= 1827.68605412662"
##    user  system elapsed 
##    9.28    0.04    9.56 
## [1] "cost= 10 gamma= 0.003 RMSE= 1811.04605728556"
##    user  system elapsed 
##    8.47    0.02    8.61 
## [1] "cost= 10 gamma= 0.001 RMSE= 1786.56345958342"
##    user  system elapsed 
##    8.18    0.04    8.41

11. Combining results

We have established that there are two methods that deliver the best results with cross-validation on the training data set:

random forests trained on numeric data
support vector machines trained on binary data transformed and reduced from 107 to 100 dimensions with PCA.

However, the calculation of both models is randomized by definition, so one should make sure to generate results that are consistently as good as possible. For this reason I will combine predictions generated using these models. Firstly, I will fit m=5 random forests to each shop’s training data and generate predictions using them. Then I will fit n=5 SVMs to the data. I will then check the RMSEs for each of these models, and establish the best way to combine them (e.g. take the median or the mean of all models, take the model that does best on the training data, etc.).

combineRFForStore <- function(storeNumber,m=5) {
  storen <- t[t$Store == storeNumber,2:ncol(t)]
  storen$Promo <- as.numeric(storen$Promo)
  storen$StateHoliday <- as.numeric(storen$StateHoliday)
  storen$SchoolHoliday <- as.numeric(storen$SchoolHoliday)
  storen$DateWeek <- as.numeric(storen$DateWeek)
  storen$DayOfWeek <- as.numeric(storen$DayOfWeek)

  salesAndPredictions <- storen[1]
  
  shuffledData <- storen[sample(nrow(storen)),]
  z <- nrow(storen)
  shuffledData$Prediction <- 0
  shuffledData$Residual <- 0
  shuffledData$Error <- 0
  
  for (k in 1:m) {
    for (i in 1:10) {
      ind <- 1:z
      sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
      test <- shuffledData[sampleIndex,]
      train <- shuffledData[-sampleIndex,(1:(ncol(shuffledData)-3))]
      modell <- randomForest(Sales ~ ., train, ntree=200)
      shuffledData[sampleIndex,]$Prediction <- predict(modell,test)
      shuffledData[sampleIndex,]$Residual <- mean(sqrt(modell$mse))
      shuffledData[sampleIndex,]$Error <- abs(shuffledData[sampleIndex,]$Sales - shuffledData[sampleIndex,]$Prediction)
    }
    salesAndPredictions <- cbind(salesAndPredictions,shuffledData[order(as.numeric(row.names(shuffledData))),(ncol(shuffledData)-2) : ncol(shuffledData)])
    names(salesAndPredictions)[(ncol(salesAndPredictions)-2):ncol(salesAndPredictions)] <- c(paste0("P",k),paste0("R",k),paste0("E",k))
    shuffledData <- shuffledData[sample(nrow(shuffledData)),]
  }
  salesAndPredictions
}

Calling the above code on a store number gives us a table with columns that contain the predictions, residual errors for each model, and prediction errors (i.e. distance between the sales and prediction in each row), for 5 passes of cross-validation.

sp <- combineRFForStore(1)
head(sp)

##      Sales    P1          R1           E1    P2          R2          E2
## 1     5263 FALSE 562.5627345 824.73022616 FALSE 555.2929919 455.8934799
## 1116  5020 FALSE 571.2980274  30.53735213 FALSE 571.9360165 105.1318442
## 2231  4782 FALSE 587.5445638 402.31782397 FALSE 571.9360165 681.3112589
## 3346  5011 FALSE 548.7851509 119.38346933 FALSE 580.1206878 234.6015614
## 4461  6102 FALSE 580.5355484 729.79520106 FALSE 585.7268193 763.1129744
## 6691  4364 FALSE 571.2980274 220.77352842 FALSE 580.1206878 203.6818098
##         P3          R3           E3    P4          R4          E4    P5
## 1    FALSE 556.9734689 455.56101592 FALSE 585.9847594 592.0242026 FALSE
## 1116 FALSE 556.9734689  80.09322094 FALSE 557.1926099 274.1452736 FALSE
## 2231 FALSE 592.3479034 745.17798335 FALSE 557.1926099 766.1443440 FALSE
## 3346 FALSE 592.3479034 392.48673976 FALSE 557.1926099 466.1901211 FALSE
## 4461 FALSE 594.4332079 774.47792923 FALSE 558.3101120 741.1280745 FALSE
## 6691 FALSE 576.0982377 179.47072841 FALSE 575.8490905 248.3175665 FALSE
##               R5          E5
## 1    587.2957511 518.5154671
## 1116 589.5290878 346.5003603
## 2231 587.2957511 608.5533910
## 3346 594.0160907 218.2420557
## 4461 594.0160907 841.2543422
## 6691 594.0160907 137.7047735

First I investigate the question whether there should be a preference for models based on the value of their mean residual error on the data on which they are trained. In order to do this, I combine all residual errors (for all 50 models) into a single column, the prediction errors (for all 50 models) into another column and square them, and then calculate the mean squared error score for each residual error value.

errorsVsResiduals <- cbind(c(sp$R1,sp$R2,sp$R3,sp$R4,sp$R5),c(sp$E1,sp$E2,sp$E3,sp$E4,sp$E5))
errorsVsResiduals[,2] <- errorsVsResiduals[,2]^2
ag <- aggregate(errorsVsResiduals[,2],list(errorsVsResiduals[,1]),mean)
ag[1:10,]

##        Group.1           x
## 1  533.2287718 831471.1400
## 2  545.8299409 609595.4572
## 3  548.7851509 463936.7054
## 4  553.3918313 330278.6786
## 5  555.2929919 419684.6768
## 6  556.9734689 627015.5655
## 7  557.1926099 523248.6238
## 8  558.3101120 516543.2643
## 9  562.5627345 579948.4461
## 10 564.6136231 391730.6034

plot(ag[,1],sqrt(ag[,2]),xlab="Training Residual Error",ylab="Prediction RMSE")
names(ag) <- c("Resid","MSE")
l <- lm(sqrt(MSE) ~ Resid, ag)
points(c(min(ag$Resid),max(ag$Resid)),predict(l,data.frame(Resid=c(min(ag$Resid),max(ag$Resid)))),type="l",col="green")

I have plotted the prediction root mean squared errors (vertical axis) against the residual errors for the models that generated the corresponding predictions (horizontal axis). There is no clear tendency visible in the plot. The residual errors are clearly concentrated between 570 and 590, but the prediction RMSEs for them are spread rather evenly between 450 and 650. Very low RMSEs are seen with residual errors as low as 575 and as high as 590. I have also fitted a linear model to these data and plotted the curve for that model (the green line). It shows a clear drop in prediction RMSE as the residual error decreases. This seems counterintuitive at first but is not completely unexpected. This finding indicates that lower residual error is likely due to overfitting on the training data and correlates with lower accuracy on the test data. This suggests that preference should be given to models that have relatively high residual error. I have repeated this trial several times both on the same store and on others, and the pattern was always the same.

As I mentioned before, another approach is to investigate how the five predictions I have generated for each row of the training table can be combined. I am going to try several possible solutions: take the mean, the harmonic and geometric mean, the median, the minimum, the maximum of the five predictions, or randomly choose one of them. I then calculate RMSE scores for each of the resulting derived predictions.

calculateErrors <- function(sp) {
  sps <- sp[1]  # initialize sps with the Sales column from sp
  sps$Mean <- apply(sp[,seq(2,ncol(sp)-2,3)],1,mean)
  sps$HarmonicMean <- apply(sp[,seq(2,ncol(sp)-2,3)],1,function(x) {1/mean(1/x)})
  sps$GeometricMean <- apply(sp[,seq(2,ncol(sp)-2,3)],1,function(x) {prod(x)^(1/length(x))})
  sps$Median <- apply(sp[,seq(2,ncol(sp)-2,3)],1,median)
  sps$Min <- apply(sp[,seq(2,ncol(sp)-2,3)],1,min)
  sps$Max <- apply(sp[,seq(2,ncol(sp)-2,3)],1,max)
  sps$Random <- apply(sp[,seq(2,ncol(sp)-2,3)],1,function(x) {x[sample(1:length(x),1)]})
  sps$RMax <- apply(sp[,seq(3,ncol(sp)-1,3)],1,max)
  sps$RMaxIndex <- apply(cbind(sp[,seq(3,ncol(sp)-1,3)],sps[ncol(sps)]),1,function(x){2+((which(x == x[length(x)])[1])-1)*3})
  sps$WorstResidual <- apply(cbind(sp,sps[ncol(sps)]),1,function(x){x[x[length(x)]]})

  errors <- rep(0,8)
  names(errors) = c("Mean","HarmonicMean","GeometricMean","Median","Min","Max","Random","WorstResidual")
  errors[1] <- sqrt(mean((sps$Sales-sps$Mean)^2))
  errors[2] <- sqrt(mean((sps$Sales-sps$HarmonicMean)^2))
  errors[3] <- sqrt(mean((sps$Sales-sps$GeometricMean)^2))
  errors[4] <- sqrt(mean((sps$Sales-sps$Median)^2))
  errors[5] <- sqrt(mean((sps$Sales-sps$Min)^2))
  errors[6] <- sqrt(mean((sps$Sales-sps$Max)^2))
  errors[7] <- sqrt(mean((sps$Sales-sps$Random)^2))
  errors[8] <- sqrt(mean((sps$Sales-sps$WorstResidual)^2))
  errors
}

calculateErrors(sp)

##          Mean  HarmonicMean GeometricMean        Median           Min 
##   4865.392356   4865.392356   4865.392356   4865.392356   4865.392356 
##           Max        Random WorstResidual 
##   4865.392356   4865.392356   4865.392356

The random value achieves essentially the same as generating any single prediction without combining the available several predictions, so we can consider it a baseline number. The three mean values are usually better than the random choice and about equal. Always choosing the minimum or maximum usually leads to the worst results, the minimum being usually worse than the maximum, although this might depend on the selected store. Choosing the median of the available predictions consistently yields the best result. Choosing the prediction that was generated by the model with the worst residual error is relatively difficult to accomplish and leads to relatively poor results. These are usually better than choosing a random value but worse than choosing the mean or median. Again I have repeated this trial for several different stores, and the results are consistently quite similar.

I also loop over the first ten stores, with 9 predictions generated for each day, to confirm that the above generalizations hold true with a somewhat bigger test set. It follows that the most accurate predictions are achieved by always selecting the median prediction.

sp <- combineRFForStore(1,9)
for (x in teststores[2:10]) {
    sp <- rbind(sp, combineRFForStore(x,9))
}
calculateErrors(sp)

##          Mean  HarmonicMean GeometricMean        Median           Min 
##   6844.227165   6844.227165   6844.227165   6844.227165   6844.227165 
##           Max        Random WorstResidual 
##   6844.227165   6844.227165   6844.227165

Increasing the number of random forests that are combined is useful, but does not result in further improvement after a point. I loop over several calls of the combineRFForStore function (with store 262), gradually increase the number of random forests, measure time and collect the median scores in a list. Finally, I plot them.

medians <- rep(0,10)
for (i in seq(3,30,3)) {
  print(system.time(sp <- combineRFForStore(262,i)))
  e <- calculateErrors(sp)
  medians[i/3] <- e["Median"]
}

##    user  system elapsed 
##   15.99    0.13   16.35 
##    user  system elapsed 
##   30.58    0.26   31.19 
##    user  system elapsed 
##   47.01    0.31   47.88 
##    user  system elapsed 
##   61.98    0.38   63.48 
##    user  system elapsed 
##   79.57    0.67   81.52 
##    user  system elapsed 
##   97.04    0.70   99.78 
##    user  system elapsed 
##  112.59    0.79  115.80 
##    user  system elapsed 
##  120.84    0.82  121.92 
##    user  system elapsed 
##  135.97    0.94  137.11 
##    user  system elapsed 
##  152.60    0.88  153.72

plot(seq(3,30,3),medians,xlab="Number of RF-based Predictions", ylab="Median of RMSEs")

The time needed to execute the function rises linearly as expected. Fitting the 10 random forests that are used (through 10-fold cross-validation) to generate the predictions takes about 5 seconds. Fitting 30 of these takes consequently about 3 minutes. The results are random, but generating three predictions for each day usually results in the highest error for the median predictions. Six predictions are usually almost as good as the bigger sets. There does not seem to be a clear improvement between 9 and 30. A choice of 9 random forest models seems to be a good choice which balances execution time (which is relatively fast) and quality (which is not substantially worse than anything else in the 9 to 30 range).

The code to combine SVM models is presented below:

combineSVMForStore <- function(storeNumber,n=5, c=30) {
  storeb <- tb[tb$Store == storeNumber,2:ncol(tb)]
  
  salesAndPredictions <- storeb[1]

  shuffledData <- storeb[sample(nrow(storeb)),]
  shuffledData$Prediction <- 0
  shuffledData$Residual <- 0
  shuffledData$Error <- 0

  z <- nrow(storeb)

  for (k in 1:n) {
    for (i in 1:10) {
      ind <- 1:z
      sampleIndex <- ind[floor(1+0.1*(i-1)*z):(0.1*i*z)]
      test <- shuffledData[sampleIndex,]
      train <- shuffledData[-sampleIndex,(1:(ncol(shuffledData)-1))]
      trank <- qr(train[3:ncol(train)])$rank
      pc <- prcomp(train[3:ncol(train)])
      pctrain <- cbind(train[1],pc$x)[1:min(trank,100)]
      pctest <- as.data.frame(predict(pc,test[3:ncol(test)-1]))[1:min(trank,100)]
      modell <- svm(Sales ~ ., pctrain, cost=c,gamma=0.001)
      shuffledData[sampleIndex,]$Prediction <- predict(modell,pctest)
      shuffledData[sampleIndex,]$Residual <- sqrt(mean((modell$residuals)^2))
      shuffledData[sampleIndex,]$Error <- abs(shuffledData[sampleIndex,]$Sales - shuffledData[sampleIndex,]$Prediction)
    }
    salesAndPredictions <- cbind(salesAndPredictions,shuffledData[order(as.numeric(row.names(shuffledData))),(ncol(shuffledData)-2) : ncol(shuffledData)])
    shuffledData <- shuffledData[sample(nrow(shuffledData)),]
    names(salesAndPredictions)[(ncol(salesAndPredictions)-2):ncol(salesAndPredictions)] <- c(paste0("P",k),paste0("R",k),paste0("E",k))
  }
  salesAndPredictions
}

This function is tested on a few stores, comparing the random forest results with the SVM results for the same stores.

for (i in c(1,128,251,262,562)) {
  print(paste("SVM for store",i))
  print(system.time(sp <- combineSVMForStore(i,9)))
  print(calculateErrors(sp))
  print(paste("RF for store",i))
  print(system.time(sp <- combineRFForStore(i,9)))
  print(calculateErrors(sp))
}

## [1] "SVM for store 1"
##    user  system elapsed 
##   58.66    0.14   58.84 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   547.0483759   544.6061618   545.0969552   542.7071117   624.2674212 
##           Max        Random WorstResidual 
##   943.6167381   602.4919114   597.2074990 
## [1] "RF for store 1"
##    user  system elapsed 
##   32.09    0.24   32.37 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   4865.392356   4865.392356   4865.392356   4865.392356   4865.392356 
##           Max        Random WorstResidual 
##   4865.392356   4865.392356   4865.392356 
## [1] "SVM for store 128"
##    user  system elapsed 
##   57.55    0.06   57.68 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   985.7672645   946.1393057   958.7544588   943.1256068  1021.5017013 
##           Max        Random WorstResidual 
##  2890.6716138  1414.6923838   998.8589337 
## [1] "RF for store 128"
##    user  system elapsed 
##   31.86    0.28   32.16 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   7756.266553   7756.266553   7756.266553   7756.266553   7756.266553 
##           Max        Random WorstResidual 
##   7756.266553   7756.266553   7756.266553 
## [1] "SVM for store 251"
##    user  system elapsed 
##   56.81    0.09   56.98 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   1925.688837   1877.580218   1893.743699   1853.206990   2004.101233 
##           Max        Random WorstResidual 
##   5285.922911   2758.693444   1917.046215 
## [1] "RF for store 251"
##    user  system elapsed 
##   31.82    0.29   32.13 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   19448.94177   19448.94177   19448.94177   19448.94177   19448.94177 
##           Max        Random WorstResidual 
##   19448.94177   19448.94177   19448.94177 
## [1] "SVM for store 262"
##    user  system elapsed 
##   74.09    0.09   74.28 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   1891.580373   1858.773258   1871.792007   1833.505411   1892.246965 
##           Max        Random WorstResidual 
##   4296.687053   2294.992818   1865.602810 
## [1] "RF for store 262"
##    user  system elapsed 
##   49.56    0.36   50.01 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##    21237.5182    21237.5182    21237.5182    21237.5182    21237.5182 
##           Max        Random WorstResidual 
##    21237.5182    21237.5182    21237.5182 
## [1] "SVM for store 562"
##    user  system elapsed 
##   78.99    0.13   79.37 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   1385.510099   1384.470882   1384.847637   1389.680890   1517.816718 
##           Max        Random WorstResidual 
##   1735.160049   1463.936827   1406.316113 
## [1] "RF for store 562"
##    user  system elapsed 
##   43.48    0.33   43.84 
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   18206.92248   18206.92248   18206.92248   18206.92248   18206.92248 
##           Max        Random WorstResidual 
##   18206.92248   18206.92248   18206.92248

Surprisingly, although random forests consistently seemed to deliver superior results when compared to support vector machines when we looked at the RMSEs of single predictions, combining several SVM-based predictions drastically reduces the error and leads to results that are much better than what we have seen so far in many shops (1, 128, 251), whereas the reverse is true in others (262, 562). I have run these trials several times, and they clearly suggest that the behavior is consistent within shops. Thus it appears that it makes sense to determine which shops work better with which model, and then generate predictions with the better type of model; or to combine SVM and RF predictions.

Unfortunately, SVMs are also slower to fit; more specifically, it takes approximately twice as long to fit them. Still, the improvement in accuracy that is seen in some shops more than makes up for this deficiency.

medians <- rep(0,5)
for (i in seq(3,15,3)) {
  print(system.time(sp <- combineSVMForStore(262,i)))
  e <- calculateErrors(sp)
  medians[i/3] <- e["Median"]
}

##    user  system elapsed 
##   24.79    0.06   24.90 
##    user  system elapsed 
##   49.50    0.05   49.61 
##    user  system elapsed 
##   73.72    0.09   73.98 
##    user  system elapsed 
##   98.47    0.14   98.81 
##    user  system elapsed 
##  123.13    0.23  123.46

plot(seq(3,15,3),medians,xlab="Number of SVM-based Predictions", ylab="Median of RMSEs")

Increasing the number of SVM-based predictions that are combined is only useful up to around 6 iterations, but does not result in further improvement after that.

Next I experiment with generating 9 SVM-based predictions with C = 30, 100 and 1000 respectively, and 9 RM-based predictions for each row and combining these in various ways.

best <- rep(0,teststores[15])
spsvm <- combineSVMForStore(1,9,30)
print(e1 <- calculateErrors(spsvm))

##          Mean  HarmonicMean GeometricMean        Median           Min 
##   556.2173335   548.4557681   551.2186891   544.0209649   622.8145605 
##           Max        Random WorstResidual 
##  1190.4424576   649.2840112   569.7104359

spsvm <- cbind(rep(1,nrow(spsvm)),spsvm)
names(spsvm)[1] <- "Store"
spsvm2 <- combineSVMForStore(1,9,100)
print(e2 <- calculateErrors(spsvm2))

##          Mean  HarmonicMean GeometricMean        Median           Min 
##   508.1176035   495.2274721   499.7816944   489.2613421   570.5219942 
##           Max        Random WorstResidual 
##  1238.6375621   615.6683465   578.7358995

spsvm3 <- combineSVMForStore(1,9,1000)
print(e3 <- calculateErrors(spsvm3))

##          Mean  HarmonicMean GeometricMean        Median           Min 
##   481.8217780   480.5393366   480.7184976   478.5506778   601.6078272 
##           Max        Random WorstResidual 
##   751.1668609   519.7521619   502.6546366

sprf <- combineRFForStore(1,9)
print(e4 <- calculateErrors(sprf))

##          Mean  HarmonicMean GeometricMean        Median           Min 
##   4865.392356   4865.392356   4865.392356   4865.392356   4865.392356 
##           Max        Random WorstResidual 
##   4865.392356   4865.392356   4865.392356

es <- c(e1[4],e2[4],e3[4],e4[4])
if (e1[4] == min(es)) {
  best[1] <- 1
} else {
  if (e2[4] == min(es)) {
    best[1] <- 2
  } else {
    if (e3[4] == min(es)) {
      best[1] <- 3
    } else {
      best[1] <- 4
    }
  }
}
  
for (i in teststores[2:15]) {
  print(paste("SVM(1) for store",i))
  sp <- combineSVMForStore(i,9,30)
  print(e1 <- calculateErrors(sp))
  sp <- cbind(rep(i,nrow(sp)),sp)
  names(sp)[1] <- "Store"
  spsvm <- rbind(spsvm,sp)
  print(paste("SVM(2) for store",i))
  sp <- combineSVMForStore(i,9,100)
  print(e2 <- calculateErrors(sp))
  spsvm2 <- rbind(spsvm2,sp)
  print(paste("SVM(3) for store",i))
  sp <- combineSVMForStore(i,9,1000)
  print(e3 <- calculateErrors(sp))
  spsvm3 <- rbind(spsvm3,sp)
  print(paste("RF for store",i))
  sp <- combineRFForStore(i,9)
  print(e4 <- calculateErrors(sp))
  sprf <- rbind(sprf,sp)
  es <- c(e1[4],e2[4],e3[4],e4[4])
  if (e1[4] == min(es)) {
    best[i] <- 1
  } else {
    if (e2[4] == min(es)) {
      best[i] <- 2
    } else {
      if (e3[4] == min(es)) {
        best[i] <- 3
      } else {
        best[i] <- 4
      }
    }
  }
}

## [1] "SVM(1) for store 3"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   886.6921131   878.2345570   880.8309748   875.2120908   972.4512192 
##           Max        Random WorstResidual 
##  1582.5811268   975.5574794   933.3558481 
## [1] "SVM(2) for store 3"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   842.2599582   843.6035004   840.3440793   827.8073063  1076.6447985 
##           Max        Random WorstResidual 
##  1577.7096684  1033.9610190   910.8981422 
## [1] "SVM(3) for store 3"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   831.6306193   880.1541002   846.6431962   821.2512171  1580.8447852 
##           Max        Random WorstResidual 
##   984.9484978   963.5207627   895.3741700 
## [1] "RF for store 3"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   7279.432151   7279.432151   7279.432151   7279.432151   7279.432151 
##           Max        Random WorstResidual 
##   7279.432151   7279.432151   7279.432151 
## [1] "SVM(1) for store 7"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   1071.240280   1079.386513   1074.493088   1063.146507   1452.464972 
##           Max        Random WorstResidual 
##   1119.760668   1092.655302   1083.681666 
## [1] "SVM(2) for store 7"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##  1009.9360682  1014.4992647  1011.3200211   996.5205004  1229.0971854 
##           Max        Random WorstResidual 
##  1541.5369399  1139.6996573  1048.2345652 
## [1] "SVM(3) for store 7"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   1091.744389 156922.194464           NaN   1002.387318   3545.710053 
##           Max        Random WorstResidual 
##   1307.699712   1436.881639   1093.115542 
## [1] "RF for store 7"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   9150.327463   9150.327463   9150.327463   9150.327463   9150.327463 
##           Max        Random WorstResidual 
##   9150.327463   9150.327463   9150.327463 
## [1] "SVM(1) for store 8"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   587.7418121   587.3411995   587.5004205   587.9790325   663.3869310 
##           Max        Random WorstResidual 
##   697.0941580   620.1082543   625.1968176 
## [1] "SVM(2) for store 8"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   579.9801216   574.0231715   575.9229349   574.8261956   667.9956183 
##           Max        Random WorstResidual 
##  1120.9254396   684.7632993   632.6027797 
## [1] "SVM(3) for store 8"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   571.6371083   572.4031224   571.3105374   561.4358880   786.3052653 
##           Max        Random WorstResidual 
##   804.2303737   628.5468113   607.8182412 
## [1] "RF for store 8"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   5853.845733   5853.845733   5853.845733   5853.845733   5853.845733 
##           Max        Random WorstResidual 
##   5853.845733   5853.845733   5853.845733 
## [1] "SVM(1) for store 9"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   803.9919502   760.3072135   773.3883674   755.0415969   797.6615622 
##           Max        Random WorstResidual 
##  2709.9877874  1187.7048660   802.6902016 
## [1] "SVM(2) for store 9"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   760.2129518   731.6930501   739.1318359   714.4913989   836.3047866 
##           Max        Random WorstResidual 
##  2522.0609758   909.2143319   803.5124436 
## [1] "SVM(3) for store 9"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   816.2601570   731.5231506   755.4216404   711.8659913   819.1056757 
##           Max        Random WorstResidual 
##  3479.5095344  1317.7108174   788.8333216 
## [1] "RF for store 9"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   6761.167323   6761.167323   6761.167323   6761.167323   6761.167323 
##           Max        Random WorstResidual 
##   6761.167323   6761.167323   6761.167323 
## [1] "SVM(1) for store 10"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   541.1113955   537.5541260   538.8075977   537.3477384   597.2329503 
##           Max        Random WorstResidual 
##  1005.8683356   588.8779090   567.2506384 
## [1] "SVM(2) for store 10"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   522.2504587   517.5846677   519.3817972   519.2821079   608.0763803 
##           Max        Random WorstResidual 
##  1033.1234859   633.2199619   582.5228987 
## [1] "SVM(3) for store 10"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   516.7513337   512.9365700   514.0687868   511.4762694   638.4452964 
##           Max        Random WorstResidual 
##  1071.7643012   631.6177377   548.4928346 
## [1] "RF for store 10"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   5666.637658   5666.637658   5666.637658   5666.637658   5666.637658 
##           Max        Random WorstResidual 
##   5666.637658   5666.637658   5666.637658 
## [1] "SVM(1) for store 11"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   1137.486314   1115.542678   1123.541125   1105.974839   1203.042213 
##           Max        Random WorstResidual 
##   2451.065440   1235.981269   1425.388899 
## [1] "SVM(2) for store 11"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   1081.886595   1050.466724   1059.875718   1051.957339   1236.542612 
##           Max        Random WorstResidual 
##   2783.116091   1397.780945   1050.963688 
## [1] "SVM(3) for store 11"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   1008.180491   2675.181708           NaN   1002.136206   1466.293025 
##           Max        Random WorstResidual 
##   1874.013761   1179.120886   1147.447585 
## [1] "RF for store 11"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   8336.848314   8336.848314   8336.848314   8336.848314   8336.848314 
##           Max        Random WorstResidual 
##   8336.848314   8336.848314   8336.848314 
## [1] "SVM(1) for store 12"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   904.0394359   876.9985210   886.1881400   859.4536710   924.3889094 
##           Max        Random WorstResidual 
##  2424.3784360  1249.8088251   921.4300465 
## [1] "SVM(2) for store 12"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   878.8461784   841.4492324   852.5346182   835.9096094   943.2381004 
##           Max        Random WorstResidual 
##  2955.1499747  1430.8224306   950.5872758 
## [1] "SVM(3) for store 12"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   907.2567421   845.3606695   861.1895857   842.6188198  1005.5352834 
##           Max        Random WorstResidual 
##  3541.4955625  1283.4030277  1046.5859341 
## [1] "RF for store 12"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   7861.834065   7861.834065   7861.834065   7861.834065   7861.834065 
##           Max        Random WorstResidual 
##   7861.834065   7861.834065   7861.834065 
## [1] "SVM(1) for store 13"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   721.7162910   724.2015922   722.6623421   725.4628937   840.2247204 
##           Max        Random WorstResidual 
##   929.2918098   761.5470674   809.4916836 
## [1] "SVM(2) for store 13"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   721.0671970   720.3311338   719.9750514   720.7444773   847.6468035 
##           Max        Random WorstResidual 
##  1171.4703679   838.0380973   810.7961776 
## [1] "SVM(3) for store 13"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   732.0027586   739.2274107   734.5988802   734.0670337  1084.3334400 
##           Max        Random WorstResidual 
##   901.5887130   797.6257060   779.4568822 
## [1] "RF for store 13"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   5286.844562   5286.844562   5286.844562   5286.844562   5286.844562 
##           Max        Random WorstResidual 
##   5286.844562   5286.844562   5286.844562 
## [1] "SVM(1) for store 14"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   603.8128637   594.9630309   598.2647708   594.5538619   663.9923708 
##           Max        Random WorstResidual 
##  1204.6962953   691.4602435   622.3655136 
## [1] "SVM(2) for store 14"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   570.2212718   570.6726597   569.7971573   565.9134143   699.8272429 
##           Max        Random WorstResidual 
##   832.8007677   671.9448978   641.7335939 
## [1] "SVM(3) for store 14"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   568.9376912   570.5371586   569.1897497   561.0756271   727.5160909 
##           Max        Random WorstResidual 
##   840.5882147   657.2871851   757.3678156 
## [1] "RF for store 14"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##     5702.6173     5702.6173     5702.6173     5702.6173     5702.6173 
##           Max        Random WorstResidual 
##     5702.6173     5702.6173     5702.6173 
## [1] "SVM(1) for store 15"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   823.1962536   790.6439707   800.8401862   782.8023396   866.2328633 
##           Max        Random WorstResidual 
##  2425.9092020  1105.6531487   832.6074777 
## [1] "SVM(2) for store 15"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   751.6079078   746.0290621   748.1944698   735.4663313   892.8401471 
##           Max        Random WorstResidual 
##  1288.0163486   856.0561395   851.7882099 
## [1] "SVM(3) for store 15"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   762.4185249   745.8740980   750.4112548   741.9594954   938.8330296 
##           Max        Random WorstResidual 
##  1941.3534453  1064.2254154   813.3035391 
## [1] "RF for store 15"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   6866.798896   6866.798896   6866.798896   6866.798896   6866.798896 
##           Max        Random WorstResidual 
##   6866.798896   6866.798896   6866.798896 
## [1] "SVM(1) for store 16"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   736.0852472   719.1100320   725.1983297   702.2166189   776.8328010 
##           Max        Random WorstResidual 
##  1859.1902332   931.3263433   751.9494886 
## [1] "SVM(2) for store 16"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   687.8638801   671.5744547   677.3507078   663.0925373   772.7677931 
##           Max        Random WorstResidual 
##  1835.3372596   845.8755141   765.3886367 
## [1] "SVM(3) for store 16"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   694.7667917   670.1098699   677.6090695   656.4446567   804.4070440 
##           Max        Random WorstResidual 
##  2157.7477967   969.2154324  1709.1294958 
## [1] "RF for store 16"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   7912.951995   7912.951995   7912.951995   7912.951995   7912.951995 
##           Max        Random WorstResidual 
##   7912.951995   7912.951995   7912.951995 
## [1] "SVM(1) for store 19"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   724.1964028   713.1630456   717.5309387   708.3380552   771.1089626 
##           Max        Random WorstResidual 
##  1367.8701759   784.3305282   752.8509324 
## [1] "SVM(2) for store 19"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   703.1996247   681.3446891   687.6399956   668.9656044   803.1113181 
##           Max        Random WorstResidual 
##  2116.3241258   960.1600227   715.1356376 
## [1] "SVM(3) for store 19"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   706.1614159   675.0823639   684.2140466   664.9845195   774.1792958 
##           Max        Random WorstResidual 
##  2269.5609221   980.4870871   729.1072629 
## [1] "RF for store 19"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   6720.942768   6720.942768   6720.942768   6720.942768   6720.942768 
##           Max        Random WorstResidual 
##   6720.942768   6720.942768   6720.942768 
## [1] "SVM(1) for store 20"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   754.2735776   753.4836821   753.5391313   758.8901111   883.5699669 
##           Max        Random WorstResidual 
##  1050.6965095   843.5227834   814.7508368 
## [1] "SVM(2) for store 20"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   754.6761114   753.3091153   753.8403900   765.6914858   902.8926035 
##           Max        Random WorstResidual 
##  1076.3122725   809.4277620   830.9759715 
## [1] "SVM(3) for store 20"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   763.6950493   766.7143385   764.7837020   767.5652662  1008.1508108 
##           Max        Random WorstResidual 
##  1052.1450029   833.1155438   903.2547734 
## [1] "RF for store 20"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   7939.473525   7939.473525   7939.473525   7939.473525   7939.473525 
##           Max        Random WorstResidual 
##   7939.473525   7939.473525   7939.473525 
## [1] "SVM(1) for store 21"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   739.4236181   735.1079984   736.3933183   731.2712034   842.6407425 
##           Max        Random WorstResidual 
##  1192.9262575   829.2542327   777.2229044 
## [1] "SVM(2) for store 21"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   725.3785723   722.8530738   723.6889388   718.2075379   832.5824423 
##           Max        Random WorstResidual 
##  1100.4462701   820.3624635   795.4384827 
## [1] "SVM(3) for store 21"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   719.7755747   723.5715856   720.8928068   721.8340240  1017.1181317 
##           Max        Random WorstResidual 
##   904.7028094   763.4841907   766.6314524 
## [1] "RF for store 21"
##          Mean  HarmonicMean GeometricMean        Median           Min 
##   5766.073943   5766.073943   5766.073943   5766.073943   5766.073943 
##           Max        Random WorstResidual 
##   5766.073943   5766.073943   5766.073943

spcombined <- spsvm[2]
spcombined <- cbind(spcombined,spsvm[seq(3,ncol(spsvm),3)])
spcombined <- cbind(spcombined,spsvm2[seq(2,ncol(spsvm2),3)])
spcombined <- cbind(spcombined,spsvm3[seq(2,ncol(spsvm3),3)])
spcombined <- cbind(spcombined,sprf[seq(2,ncol(sprf),3)])
spcombined$Mean <- apply(spcombined[,2:37],1,mean)
spcombined$HarmonicMean <- apply(spcombined[,2:37],1,function(x) {1/mean(1/x)})
spcombined$Median <- apply(spcombined[,2:37],1,median)
spcombined$SVM1Median <- apply(spcombined[,2:10],1,median)
spcombined$SVM2Median <- apply(spcombined[,11:19],1,median)
spcombined$SVM3Median <- apply(spcombined[,20:28],1,median)
spcombined$RFMedian <- apply(spcombined[,29:37],1,median)
spcombined$Median2_4 <- apply(spcombined[,11:37],1,median)
spcombined$Mean2_4 <- apply(spcombined[,11:37],1,mean)
spcombined$Mean3_4 <- apply(spcombined[,20:37],1,mean)
spcombined$Store <- spsvm[1]
spcombined$BestIndex <- apply(spcombined[,48],1,function(x){best[x[1]]})
spcombined$BestModelMedian <- apply(spcombined[,c(41:44,49)],1,function(x){x[x[5]]})

errors <- rep(0,11)
errors[1] <- sqrt(mean((spcombined$Sales-spcombined$Mean)^2))
errors[2] <- sqrt(mean((spcombined$Sales-spcombined$HarmonicMean)^2))
errors[3] <- sqrt(mean((spcombined$Sales-spcombined$Median)^2))
errors[4] <- sqrt(mean((spcombined$Sales-spcombined$SVM1Median)^2))
errors[5] <- sqrt(mean((spcombined$Sales-spcombined$SVM2Median)^2))
errors[6] <- sqrt(mean((spcombined$Sales-spcombined$SVM3Median)^2))
errors[7] <- sqrt(mean((spcombined$Sales-spcombined$RFMedian)^2))
errors[8] <- sqrt(mean((spcombined$Sales-spcombined$Median2_4)^2))
errors[9] <- sqrt(mean((spcombined$Sales-spcombined$Mean2_4)^2))
errors[10] <- sqrt(mean((spcombined$Sales-spcombined$Mean3_4)^2))
errors[11] <- sqrt(mean((spcombined$Sales-spcombined$BestModelMedian)^2))


names(errors) = c("Mean","HarmonicMean","Median","SVM1Median","SVM2Median","SVM3Median","RFMedian","Median2_4","Mean2_4","Mean3_4","BestModelMedian")

errors

##            Mean    HarmonicMean          Median      SVM1Median 
##    1849.4698715    6910.6134247     754.5727477     773.8190255 
##      SVM2Median      SVM3Median        RFMedian       Median2_4 
##     739.6420578     734.3044560    6912.1748825     756.6036705 
##         Mean2_4         Mean3_4 BestModelMedian 
##    2380.5779594    3502.0374637     731.3195810

The results for the first 15 stores are quite clear: SVMs work better than random forests for all shops. If we take the median of the predictions generated by support vector machines using any settings, we get a more accurate prediction overall than by taking the median of random-forest-based predictions. The improvement is rather substantial.

What is surprising, however, is the fact despite the sometimes far worse accuracy of random forest predictions, taking any kind of mean of all 36 predictions that were generated with the four methods (i.e. the “Mean”, “HarmonicMean” and “Median” figures) actually yields better results than the most accurate SVM-based predictions (which is mostly the SVM3Median value, which is the error of the median prediction generated by SVMs at C=1000). In fact, for the first 15 shops we find that taking the mean of all predictions generated by C=100 and C=1000 SVMs together with the random forests (i.e. the “Median2_4” and “Mean2_4” results) yields the lowest cross-validation error.

Finally, it is worth noting that always choosing the median of the predictions generated by the model with the lowest RMSE, i.e. the most accurate type of model, is still slightly less accurate than choosing the mean or median of all predictions.

The logical next step is to repeat this same process on the whole data set (all stores) to see whether it is possible to confirm this. Fitting this amount of predictions to the training set takes relatively long, 10+ hours for both SVMs and RFs, so I have not run the code inside the R markup but separately, using the lines below. The results confirmed the above findings.

sp <- combineRFForStore(1,9)
for (x in teststores[-1]) {
  sp <- rbind(sp, combineRFForStore(x,9))
}

I conclude from the outcomes of these experiments on subsets of the whole training data that it is consistently very useful to generate predictions using several different models (SVMs using different parameters, with the data transformed by PCA, as well as random forests) and combine them by taking the mean or median of all predictions. This led to results that were clearly superior to generating single predictions using any of the methods, and even to generating several predictions, all using the same method.

12. Generating Predictions

I generated predictions for the test data in the Kaggle competition based on what I have learned about the data above.

First I add the additional variables like I did for the training table.

ttest$DateYear <- as.numeric(strftime(ttest$Date, format="%y"))
ttest$DateMonth <- as.numeric(strftime(ttest$Date, format="%m"))
ttest$DateDay <- as.numeric(strftime(ttest$Date, format="%d"))
ttest$DateWeek <- as.numeric(strftime(ttest$Date, format="%W"))
ttest <- ttest[c(1,4:11,2)]
ttest$Competition <- F
ttest$Promo2 <- F

for (i in teststores) {
    if (!is.na(s[i,5])) {
        m <- s[i,5]
        y <- s[i,6] - 2000
        ttest[ttest$Store == i,]$Competition <- ttest[ttest$Store == i,]$DateYear > y | (ttest[ttest$Store == i,]$DateMonth >= m &  ttest[ttest$Store == i,]$DateYear == y)
    }
    if (s[i,7]) {
        w <- s[i,8]
        y <- s[i,9] - 2000
        ttest[ttest$Store == i,]$Promo2 <- ttest[ttest$Store == i,]$DateYear > y | (ttest[ttest$Store == i,]$DateWeek >= w & ttest[ttest$Store == i,]$DateYear == y)
    }
}

This table will be used for the random forest predictions. I also create its binary version for the SVM predictions. The problem here is that this cannot be accomplished simply by transforming the numeric variables that are present in the test set into binary variables, since for some variables there are less values present in the test table than in the training table. For example, the Year variable had three possible values in the training table (13, 14, 15) but only one value (15) in the test table. Thus automatic transformation would lead to two variables in the training table (14TRUE, 15TRUE) but no variable at all in the test table (since 15 is constant in this table). This is a problem because the models will be fitted to the training data, and predictions using these models will only work is the very same variables are present in the test data as in the training data.

The solution is to combine the rows of the training and the test table before carrying out the binary transformation, and then extracting the rows with the test data. This is achieved with the following code.

ttemp1 <- t[3:12]
ttemp2 <- ttest[3:12]
lastTrainRow <- nrow(ttemp1)
ttemp <- rbind(ttemp1,ttemp2)
ttemp <- cbind(rep(1,nrow(ttemp)),ttemp)
names(ttemp)[1] <- "Dummy"
tbtemp <- model.matrix(Dummy ~ Promo + StateHoliday + SchoolHoliday + DayOfWeek + as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + as.factor(DateWeek) + Competition + Promo2, data = ttemp)
tbtest <- cbind(ttest[1:2],tbtemp[(lastTrainRow+1):nrow(tbtemp),2:ncol(tbtemp)])

I also transform the data type for the factor and logical variables in the numeric tables to numeric.

tn <- t
tn$Promo <- as.numeric(tn$Promo)
tn$StateHoliday <- as.numeric(tn$StateHoliday)
tn$SchoolHoliday <- as.numeric(tn$SchoolHoliday)
tn$DayOfWeek <- as.numeric(tn$DayOfWeek)
tn$Competition<- as.numeric(tn$Competition)
tn$Promo2 <- as.numeric(tn$Promo2)

tntest <- ttest
tntest$Promo <- as.numeric(tntest$Promo)
tntest$StateHoliday <- as.numeric(tntest$StateHoliday)
tntest$SchoolHoliday <- as.numeric(tntest$SchoolHoliday)
tntest$DayOfWeek <- as.numeric(tntest$DayOfWeek)
tntest$Competition<- as.numeric(tntest$Competition)
tntest$Promo2 <- as.numeric(tntest$Promo2)

Before generating the actual predictions, I create a new Prediction variable in the ttest table, and set it to 0 everywhere.

ttest$Prediction <- 0

A further complication is due to the fact that the procedure that resulted in the best predictions above cannot be directly carried over to fitting models to the whole data set. Recall that the best results were achieved when at least 6 different SVM-based predictions were generated for each open day of each shop, and we took the median of these predictions. While this can be easily reproduced with random forests by fitting the required number of forests to the whole data set, it will not work with SVMs since their implementation is completely deterministic, i.e. if SVMs are fitted to the same training set and predictions are generated for the same test set several times, then the very same predictions are generated every time.

One option would of course be to generate a single prediction with SVMs, but I chose a different one. I proceed in the same way as earlier, during cross-validation. I divide up the whole training set into ten equal parts and train 10 separate SVMs, on a different 90% subset of the data each. This results in ten separate predictions, each based on a slightly different set of data. I repeat this procedure for C = 100 and C = 1000.

To summarize, the following procedure will be followed for all stores:

I select all rows from the traning table for that store.
I select all rows from the test table that describe open days for that store.
I fit 9 random forests on the whole set and generate predictions for the test data
I divide the training data into 10 parts and train 10 SVMs with C = 100, leaving out a different 10% of the training data each time. I generate a set of predictions using each model.
I repeat the previous process using 10 SVMs with C = 1000.
I take the mean of the 29 predictions that were generated.
I copy the predictions into the test table.

print(system.time(for (storeNumber in teststores) {
  print(storeNumber)
  storen <- tn[tn$Store == storeNumber,2:ncol(tn)]
  storentest <- tntest[tntest$Store == storeNumber & tntest$Open,3:ncol(tntest)]
  
  storeb <- tb[tb$Store == storeNumber,2:ncol(tb)]
  storeb <- storeb[sample(nrow(storeb)),]

  storebtest <- tbtest[tbtest$Store == storeNumber & tbtest$Open,3:ncol(tbtest)]
  
  trank <- qr(storeb[3:ncol(storeb)])$rank
  pc <- prcomp(storeb[3:ncol(storeb)])
  
  pctrain <- cbind(storeb[1],pc$x)[1:min(trank-1,100)]
  pctest <- as.data.frame(predict(pc,storebtest))[1:min(trank,100)]
 
  predictions <- matrix(rep(0, nrow(storentest) * 29), nrow = nrow(storentest), ncol = 29)
   
  for (c in 1:9) {
    modell <- randomForest(Sales ~ .,storen, ntree=200)
    predictions[,c] <- as.vector(predict(modell,storentest))
  }

  z <- nrow(storen)
  shuffledIndices <- sample(z)
  
  for (c in 1:10) {
    sampleIndex <- floor(1+0.1*(c-1)*z):(0.1*c*z)
    trainParts <- pctrain[shuffledIndices[-sampleIndex],]
    modell <- svm(Sales ~ ., trainParts, cost=100,gamma=0.001)
    predictions[,c+9] <- as.vector(predict(modell,pctest))
  }
  
  shuffledIndices <- sample(z)

  for (c in 1:10) {
    sampleIndex <- floor(1+0.1*(c-1)*z):(0.1*c*z)
    trainParts <- pctrain[shuffledIndices[-sampleIndex],]
    modell <- svm(Sales ~ ., trainParts, cost=1000,gamma=0.001)
    predictions[,c+19] <- as.vector(predict(modell,pctest))
  }
  
  ttest[ttest$Store == storeNumber & ttest$Open,]$Prediction <- apply(predictions,1,median)
}))

Finally the prediction is saved in a simple two-column CSV file as required.

submission <- cbind(1:nrow(ttest),ttest$Prediction)
write.csv(submission,file="submission.csv",sep=",",row.names=FALSE)

13. Evaluation

The result was submitted to the Kaggle competition system for evaluation. Unfortunately the true accuracy of the predictions turned out to be far from satisfactory: the specific code above only came in 2350th out of 3300 entries. While certainly better at a root mean squared percentage error of about 0.154 than the baseline that was set by Rossmann (which was the median of the sales for the day of week for the given shop, and achieved an RMSPE about 0.24), it was still very far from the winner’s accuracy with a score of just 0.1. Suprisingly another submission of mine that used a less sophisticated version of the above procedure, which worked with just one SVM at C=100, one at C=1000, and the median of 9 random forest predictions, and took the mean of these three figures, fared much better and reached position 2170 with a score of about 0.147 - quite a bit better, but still clearly not worth the amount of time and energy that was invested. It is not clear to me how the procedure could be further improved.

Rossmann Sales Prediction

Gergely Pethő, graduate student, Faculty of Informatics, University of Debrecen

2016/01/10