The purpose of this study is to show you how to predict future sales for a business using linear regression.

The dataset SalesData contains information on customers for the months one to three. Only the sales of month four are included.

LOOK AT THE DATA

#Import data
sales <- read.csv("salesData.csv")

#Structure of Dataset
glimpse(sales)
## Observations: 5,122
## Variables: 14
## $ id                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ nItems                <int> 1469, 1463, 262, 293, 108, 216, 174, 122...
## $ mostFreqStore         <fct> Stockton, Stockton, Colorado Springs, Co...
## $ mostFreqCat           <fct> Alcohol, Alcohol, Shoes, Bakery, Beverag...
## $ nCats                 <int> 72, 73, 55, 50, 32, 41, 36, 31, 41, 52, ...
## $ preferredBrand        <fct> Veina, Veina, Bo, Veina, Bo, Bo, Bo, Bo,...
## $ nBrands               <int> 517, 482, 126, 108, 79, 98, 78, 62, 99, ...
## $ nPurch                <int> 82, 88, 56, 43, 18, 35, 34, 12, 26, 33, ...
## $ salesLast3Mon         <dbl> 2741.97, 2790.58, 1529.55, 1765.81, 1180...
## $ salesThisMon          <dbl> 1283.87, 1242.60, 682.57, 730.23, 552.54...
## $ daysSinceLastPurch    <int> 1, 1, 1, 1, 12, 2, 2, 4, 14, 1, 7, 3, 6,...
## $ meanItemPrice         <dbl> 1.866555, 1.907437, 5.837977, 6.026655, ...
## $ meanShoppingCartValue <dbl> 33.43866, 31.71114, 27.31339, 41.06535, ...
## $ customerDuration      <int> 821, 657, 548, 596, 603, 673, 612, 517, ...

Now let’s visualize the correlation of the continuous explanatory variables for the past three months with the sales variable of this month.

#Remove id variable from sales
salesData <- sales[ ,-1]

#Correlation
salesData %>% 
  select_if(is.numeric) %>%
  cor() %>%
  corrplot()

According to this plot, we see which variables are probably well suited to explain the sales of this month (salesThisMon). Particulary, we see that the sales in the last three months (salesLast3Mon) are strongly positively correlated with the sales in this month(salesThisMon). Hence we will start off including that as an explanatory variable in a linear regression.

ESTIMATING SIMPLE LINEAR REGRESSION

#Simple linear regression model
salesSimpleModel <- lm(salesThisMon ~ salesLast3Mon, data = salesData) 

summary(salesSimpleModel)
## 
## Call:
## lm(formula = salesThisMon ~ salesLast3Mon, data = salesData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -570.18  -68.26    3.21   72.98  605.58 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.690501   6.083886   16.39   <2e-16 ***
## salesLast3Mon  0.382696   0.004429   86.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117.5 on 5120 degrees of freedom
## Multiple R-squared:  0.5932, Adjusted R-squared:  0.5931 
## F-statistic:  7465 on 1 and 5120 DF,  p-value: < 2.2e-16

There is a positive relationship (slope = 0.38) between the two variables. R-squared equals to 0.59 so we can say that almost 60 percent of the variation in the sales of this month can be explained by the sales in the last three months.

Now, let’s calculate a full model using all variables (without id) in order to explain the sales in this month.

MULTIPLE LINEAR REGRESSION

#Multiple linear regression
sales_full_Mlr <- lm(salesThisMon ~ ., data = salesData) 

summary(sales_full_Mlr)
## 
## Call:
## lm(formula = salesThisMon ~ ., data = salesData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -322.76  -50.76    0.78   50.90  398.79 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -2.585e+02  1.762e+01 -14.673  < 2e-16 ***
## nItems                         1.605e-01  2.709e-02   5.923 3.37e-09 ***
## mostFreqStoreColorado Springs -7.167e+00  4.350e+00  -1.648 0.099503 .  
## mostFreqStoreColumbus          9.579e-01  3.680e+00   0.260 0.794642    
## mostFreqStoreDenver           -8.601e+00  5.130e+00  -1.676 0.093722 .  
## mostFreqStoreHonolulu         -1.588e+01  4.916e+00  -3.231 0.001242 ** 
## mostFreqStoreJersey           -2.169e+01  5.031e+00  -4.311 1.66e-05 ***
## mostFreqStoreOrlando          -1.052e+01  4.492e+00  -2.342 0.019210 *  
## mostFreqStoreSan Diego        -2.009e+01  5.717e+00  -3.514 0.000446 ***
## mostFreqStoreSeattle          -9.784e+00  3.539e+00  -2.765 0.005716 ** 
## mostFreqStoreStockton         -1.176e+02  3.580e+01  -3.286 0.001022 ** 
## mostFreqCatBaby               -3.413e+00  3.513e+00  -0.972 0.331249    
## mostFreqCatBakery             -1.025e+01  5.456e+00  -1.879 0.060339 .  
## mostFreqCatBeverages           3.351e-01  7.008e+00   0.048 0.961867    
## mostFreqCatClothes            -8.527e+00  6.213e+00  -1.372 0.170010    
## mostFreqCatFresh food         -6.372e+00  7.245e+00  -0.880 0.379164    
## mostFreqCatFrozen food        -8.084e+00  3.840e+00  -2.105 0.035332 *  
## mostFreqCatPackaged food      -8.346e-01  4.356e+00  -0.192 0.848063    
## mostFreqCatPets                8.508e+00  7.242e+00   1.175 0.240102    
## mostFreqCatShoes               3.298e+00  3.286e+00   1.004 0.315452    
## nCats                         -7.917e-01  2.345e-01  -3.375 0.000742 ***
## preferredBrandAlekto          -5.590e+00  1.649e+01  -0.339 0.734645    
## preferredBrandBo              -2.505e+01  1.438e+01  -1.742 0.081516 .  
## preferredBrandKatram          -6.264e+01  2.334e+01  -2.684 0.007295 ** 
## preferredBrandKellest         -5.349e+01  2.214e+01  -2.416 0.015713 *  
## preferredBrandMedeia          -2.161e+01  1.556e+01  -1.389 0.164967    
## preferredBrandMoone           -4.166e+01  1.627e+01  -2.561 0.010453 *  
## preferredBrandNilima          -2.888e+01  1.454e+01  -1.986 0.047040 *  
## preferredBrandTanvi            3.135e+01  2.129e+01   1.472 0.141076    
## preferredBrandVeina           -1.861e+01  1.451e+01  -1.282 0.199837    
## nBrands                       -4.804e-02  8.468e-02  -0.567 0.570533    
## nPurch                         4.758e-01  1.513e-01   3.145 0.001669 ** 
## salesLast3Mon                  3.753e-01  8.599e-03  43.652  < 2e-16 ***
## daysSinceLastPurch             1.794e-01  1.524e-01   1.177 0.239322    
## meanItemPrice                  1.793e-01  9.289e-02   1.930 0.053680 .  
## meanShoppingCartValue          2.596e-01  2.618e-02   9.918  < 2e-16 ***
## customerDuration               5.713e-01  7.148e-03  79.927  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.34 on 5085 degrees of freedom
## Multiple R-squared:  0.8249, Adjusted R-squared:  0.8237 
## F-statistic: 665.6 on 36 and 5085 DF,  p-value: < 2.2e-16

The full model gives an R² equals to 0.82 which is better than the R² of the simple linear regression model. But, we can see that some coefficients of the full model are not significant (p-value > 0.05). We have to remove the variables concerned from the model.

# New Multiple linear regression model
sales_Mlr_new <- lm(salesThisMon ~ . -mostFreqStore -mostFreqCat -preferredBrand -nBrands -daysSinceLastPurch -meanItemPrice, data = salesData) 

summary(sales_Mlr_new)
## 
## Call:
## lm(formula = salesThisMon ~ . - mostFreqStore - mostFreqCat - 
##     preferredBrand - nBrands - daysSinceLastPurch - meanItemPrice, 
##     data = salesData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -314.56  -52.04    0.60   51.78  417.07 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.779e+02  8.644e+00 -32.149  < 2e-16 ***
## nItems                 1.557e-01  1.963e-02   7.933 2.62e-15 ***
## nCats                 -1.034e+00  1.841e-01  -5.617 2.05e-08 ***
## nPurch                 5.983e-01  1.483e-01   4.034 5.57e-05 ***
## salesLast3Mon          3.687e-01  7.764e-03  47.489  < 2e-16 ***
## meanShoppingCartValue  2.839e-01  2.381e-02  11.926  < 2e-16 ***
## customerDuration       5.694e-01  7.191e-03  79.174  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.96 on 5115 degrees of freedom
## Multiple R-squared:  0.8211, Adjusted R-squared:  0.8208 
## F-statistic:  3911 on 6 and 5115 DF,  p-value: < 2.2e-16

After removing some variables, we see that all variables are significant and the R² of the new model is 0.82 which is close to the value of R² of the full model. We keep the model with the fewest variables.

Instead of doing removing non signifificant vaiables manually, we can do it automatically. When building a model you have to figure out which variables to include. One usefull tool is the function stepAIC() of the MASS package. A full model is iteratively compared to several other models such that variables are dropped and added based on their significance. The process goes on as long as the AIC value decreases and stops when a minimum is reached. At the end of this process, you’ll see a model with fewer explanatory variables and a superior AIC value.

#Model specification
sales_Mlr_aic <- stepAIC(sales_full_Mlr, trace = 0, steps = 5000)

summary(sales_Mlr_aic)
## 
## Call:
## lm(formula = salesThisMon ~ nItems + mostFreqStore + nCats + 
##     preferredBrand + nPurch + salesLast3Mon + meanItemPrice + 
##     meanShoppingCartValue + customerDuration, data = salesData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -321.31  -50.29    0.63   51.07  411.71 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -2.551e+02  1.741e+01 -14.656  < 2e-16 ***
## nItems                         1.576e-01  2.059e-02   7.656 2.28e-14 ***
## mostFreqStoreColorado Springs -7.204e+00  4.338e+00  -1.660 0.096887 .  
## mostFreqStoreColumbus          1.229e+00  3.674e+00   0.334 0.738064    
## mostFreqStoreDenver           -8.709e+00  5.126e+00  -1.699 0.089413 .  
## mostFreqStoreHonolulu         -1.578e+01  4.901e+00  -3.220 0.001289 ** 
## mostFreqStoreJersey           -2.169e+01  5.029e+00  -4.314 1.63e-05 ***
## mostFreqStoreOrlando          -1.039e+01  4.486e+00  -2.316 0.020599 *  
## mostFreqStoreSan Diego        -1.968e+01  5.713e+00  -3.445 0.000576 ***
## mostFreqStoreSeattle          -9.963e+00  3.533e+00  -2.820 0.004819 ** 
## mostFreqStoreStockton         -1.199e+02  3.562e+01  -3.367 0.000766 ***
## nCats                         -7.835e-01  2.012e-01  -3.894 0.000100 ***
## preferredBrandAlekto          -6.216e+00  1.647e+01  -0.377 0.705931    
## preferredBrandBo              -2.569e+01  1.436e+01  -1.789 0.073627 .  
## preferredBrandKatram          -6.271e+01  2.313e+01  -2.712 0.006717 ** 
## preferredBrandKellest         -5.266e+01  2.213e+01  -2.379 0.017385 *  
## preferredBrandMedeia          -2.337e+01  1.556e+01  -1.502 0.133038    
## preferredBrandMoone           -4.322e+01  1.619e+01  -2.669 0.007626 ** 
## preferredBrandNilima          -2.823e+01  1.452e+01  -1.944 0.051987 .  
## preferredBrandTanvi            2.707e+01  2.106e+01   1.285 0.198772    
## preferredBrandVeina           -2.013e+01  1.447e+01  -1.391 0.164316    
## nPurch                         4.646e-01  1.495e-01   3.109 0.001888 ** 
## salesLast3Mon                  3.697e-01  8.257e-03  44.773  < 2e-16 ***
## meanItemPrice                  1.864e-01  8.960e-02   2.080 0.037533 *  
## meanShoppingCartValue          2.687e-01  2.461e-02  10.916  < 2e-16 ***
## customerDuration               5.708e-01  7.148e-03  79.856  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.39 on 5096 degrees of freedom
## Multiple R-squared:  0.8243, Adjusted R-squared:  0.8235 
## F-statistic: 956.6 on 25 and 5096 DF,  p-value: < 2.2e-16
#Formula of the "true" model 
as.formula(summary(sales_Mlr_aic)$call)
## salesThisMon ~ nItems + mostFreqStore + nCats + preferredBrand + 
##     nPurch + salesLast3Mon + meanItemPrice + meanShoppingCartValue + 
##     customerDuration

Unfortunately, there are still non-significant variables in the model sales_Mlr_aic so for the rest we will keep the model sales_Mlr_new. Nevertheless the function stepAIC() is very useful for the selection of models.

INTERPRETATION OF COEFFICIENTS

Let’s review the coefficients of sales_Mlr_new model.

# Model "sales_Mlr_new" in a tidy version
tidy(sales_Mlr_new)
## # A tibble: 7 x 5
##   term                  estimate std.error statistic   p.value
##   <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)           -278.      8.64       -32.1  1.05e-206
## 2 nItems                   0.156   0.0196       7.93 2.62e- 15
## 3 nCats                   -1.03    0.184       -5.62 2.05e-  8
## 4 nPurch                   0.598   0.148        4.03 5.57e-  5
## 5 salesLast3Mon            0.369   0.00776     47.5  0.       
## 6 meanShoppingCartValue    0.284   0.0238      11.9  2.33e- 32
## 7 customerDuration         0.569   0.00719     79.2  0.

For example, we can say that the effect of the customer duration on the sales this month is statistically significant. A one-unit increase in the customer duration leads to a 0.57 Euro increase in the sales of this month. (The longer a customer stays in the business, the more sales he should generate).

PREDICTION OF FUTURE SALES

The new dataset called salesData2_4 contains information on the customers for the months two to four. We want to use this information in order to predict the sales for month 5.

#Import dataset
salesData2_4 <- read.csv("salesDataMon2To4.csv")

#predict the sales for month 5
predSales5 <- predict(sales_Mlr_new, newdata = salesData2_4)

#Mean of the expected sales
mean(predSales5, na.rm = FALSE)
## [1] 625.0973