The main goal of the project is to use various contextual data to predict sales in a number of Rossmann shops. Rossmann is Germany’s second-largest drug store chain, and there are around 3000 stores in other parts of Europe.

Let’s load the data and see what we have available.

library(pacman)
p_load(readr, dplyr, ggplot2, ggthemr, randomForest)
df <- read_csv("data/train.csv")
names(df)
## [1] "Store"         "DayOfWeek"     "Date"          "Sales"        
## [5] "Customers"     "Open"          "Promo"         "StateHoliday" 
## [9] "SchoolHoliday"

All of those columns look like they can be of use to building a predictive model. The most obvious way to predict sales is to have customer data. Unfortunately we can’t really have “future” customer data, so we won’t be using this to design or model. Still we can see if there is a real linear relationship between the Sales variable and Customers. First let’s make a plot and have a look.

ggthemr('fresh')

plot_1 <- df %>% sample_frac(0.1) %>% 
            select(Sales, Customers) %>% 
            ggplot(aes(Sales, Customers)) + geom_point(alpha = 0.3) +
            ggtitle("Sales ~ Customers")

plot_2 <- df %>% sample_frac(0.1) %>%
            ggplot(aes(Sales)) + geom_histogram(binwidth = 500) +
            ggtitle("Distribution of Sales")



multiplot(plot_1, plot_2, cols = 2)

You can notice that we use just a fraction of the dataset, just to save computation time. We also add some transparency to minimize the effects of overplotting. We can see there is clearly a linear positive correlation between the two variables, so we can try to fit a linear regression model.

From the histogram we can see that there is a relatively normal distribution of Sales numbers, with quite a few being 0. This is mot probably because the stores were closed at the time.

lm <- df %>%
        select(Sales, Customers) %>% 
        lm()

summary(lm)
## 
## Call:
## lm(formula = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28685.0  -1077.7   -253.6    882.4  27735.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.078e+03  2.883e+00   373.9   <2e-16 ***
## Customers   7.417e+00  3.671e-03  2020.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1720 on 1017207 degrees of freedom
## Multiple R-squared:  0.8005, Adjusted R-squared:  0.8005 
## F-statistic: 4.082e+06 on 1 and 1017207 DF,  p-value: < 2.2e-16

R^2 of 0.80 is a good indicator of a modestly strong linear relationship. Here are some plots that provide more graphical information about our model.

plot(lm)

A nice model, but as mentioned before we can’t use Customer data to accomplish the aim of our project. A very good an popular algorithm that can be used for both regression and classification problems is the Random Forest. In a nutshell it works by building many decision trees (a parameter that we can specify) and giving us the mean of the individual tree at the end of the process. This is a very powerful method and is widely used because it solves a few overfitting issues that are common for decision trees in general.

So let’s get started by loading all the data.

train_df <- read.csv("data/train.csv")
test_df <- read.csv("data/test.csv")
sample_submission <- read.csv("data/sample_submission.csv")

We also have to perform some necessary data cleaning, so that our algorithm does not get confused with a missing column.

train_df$Customers <- NULL 
train_df$Date <- NULL # we are not going to be using the dates column in this iteration
test_df$Date <- NULL

Then we can create several empty variables that we will fill later.

mean_sales <- NULL 
stores <- NULL
ids <- NULL

And finally write a for loop that computes a predicted value for every store. We take the mean of that and store it.

for (i in 1:1115) {
    store_df_train <- train_df %>%
        filter(Store == i)
    store_df_test <- test_df %>%
        filter(Store == i)
    
    store_df_test$Id <- NULL
    
    levels(store_df_test$StateHoliday) <- levels(store_df_train$StateHoliday)
    
    rf <- randomForest(Sales ~ ., data = store_df_train)
    predictions <- predict(rf, store_df_test)
    mean_sales[i] <- mean(predictions)
    stores[i] <- i
}

Now that we have the answers we have to construct our answers dataframe.

ids <- test_df$Id
ids_stores <- test_df$Store

temp_df <- data_frame(stores, mean_sales)
names(temp_df) <- c("store", "sales")


temp_df_2 <- data.frame(ids, ids_stores)
names(temp_df_2) <- c("id", "store")

answer <- merge(temp_df_2,temp_df, by = c("store"))

This algorithm achieves a Root Mean Square Percentage Error of 0.24685, so while it works, there are quite a few improvements that can be done. For example there is a lot of additional contextual data that was not taken into account, for example the distance to the nearest competitor etc.