Executive summary

The following Friday after Thankgiving Day in United States have an infamous name: Black Friday. On this day, stores ofen promoted their products at shocking attractive prices, which impluses customers to reach out their wallet and be extravagant. Though it is not an offical holiday, as the first one beginned in 1952, it quickly gain its popularity and widespread across United States, and in recent years, globally.

This project made an attempt to predict sales number of each customer spent in this special day. Though estimating number is difficult, as the sales number varies depend on many factors, simply getting a minor correction will provide an insight on business status.

The data

The data was retrieved from Kaggle website as Black Friday dataset. which comes from a sample record of transaction in one retail store in that day.

The data set includes 550 000 observations, and 12 columns, however, under some circumstance, only 3287 completed observations with 10 sensible variables is selected as the main data set. Though this is a massive reduction, having smaller data will improve the speed of model. Furthermore, this is a showcase for demonstration purpose and not for production.

The data after being preprocessed:

## Observations: 3,287
## Variables: 10
## $ Gender                     <fct> F, M, M, M, M, M, M, M, M, M, F, M,...
## $ Age                        <fct> 36-45, 36-45, 26-35, 18-25, 26-35, ...
## $ Occupation                 <fct> 1, 17, 17, 2, 7, 13, 16, 1, 4, 13, ...
## $ City_Category              <fct> B, C, A, B, B, B, B, B, B, C, B, B,...
## $ Stay_In_Current_City_Years <fct> 4+, 4+, 2, 3, 1, 1, 0, 4+, 4+, 1, 2...
## $ Marital_Status             <fct> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Product_Category_1         <fct> 6, 6, 1, 5, 1, 2, 1, 1, 8, 1, 5, 1,...
## $ Product_Category_2         <fct> 10, 8, 2, 8, 2, 3, 8, 2, 14, 2, 11,...
## $ Product_Category_3         <fct> 13, 16, 15, 14, 9, 10, 17, 15, 17, ...
## $ Purchase                   <int> 12642, 12486, 8300, 3613, 19647, 34...

Infomation about variables:

Visualization

The dependent variable column is the only numeric columns in this data

Meanwhile the rest columns are factors:

## Warning: attributes are not identical across measure variables;
## they will be dropped

Since the data is preprocessed, there is no mssing value

aggr(blackFriday,
     numbers = TRUE,
     prop = FALSE,
     sortVars = TRUE,
     cex.axis = .5,
     gap = 2,
     ylab = c("Number of misisngs", "Pattern"))

## 
##  Variables sorted by number of missings: 
##                    Variable Count
##                      Gender     0
##                         Age     0
##                  Occupation     0
##               City_Category     0
##  Stay_In_Current_City_Years     0
##              Marital_Status     0
##          Product_Category_1     0
##          Product_Category_2     0
##          Product_Category_3     0
##                    Purchase     0

Model building

Data Partitioning split into 80/20

set.seed(123)
which_train <- createDataPartition(blackFriday$Purchase, 
                                   p = 0.8, 
                                   list = FALSE) 
friday_train <- blackFriday[which_train,]
friday_test <- blackFriday[-which_train,]

Linear Regression

When it comes to Regression Analysis, Linear model retains its popularity as the first and foremost to be built. In this case, since there are many predictors, multiple Linear Regression is the most suitable. This model is expressed by formula \(y = b0 + b1*x1 + b2*x2 + b3*x3\) if it exits 3 predictors.

linear_train <-  train(Purchase ~ ., data = friday_train, method = "lm")
summary(linear_train)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12373.0  -2027.7    370.8   2300.9   7635.3 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     14105.804   1552.465   9.086  < 2e-16 ***
## GenderM                            19.027    175.664   0.108 0.913752    
## `Age18-25`                         60.510    760.384   0.080 0.936579    
## `Age26-35`                         -1.125    754.915  -0.001 0.998811    
## `Age36-45`                         19.654    762.238   0.026 0.979431    
## `Age46-50`                       -123.032    792.078  -0.155 0.876574    
## `Age51-55`                         82.104    806.601   0.102 0.918931    
## `Age55+`                          -72.634    860.340  -0.084 0.932725    
## Occupation1                       131.756    310.239   0.425 0.671098    
## Occupation2                      -162.304    385.586  -0.421 0.673843    
## Occupation3                       654.059    470.786   1.389 0.164864    
## Occupation4                      -268.258    284.249  -0.944 0.345390    
## Occupation5                       286.821    530.067   0.541 0.588483    
## Occupation6                       495.057    409.215   1.210 0.226478    
## Occupation7                       162.735    287.642   0.566 0.571609    
## Occupation8                      -604.656   1173.220  -0.515 0.606331    
## Occupation9                      -370.920    720.199  -0.515 0.606581    
## Occupation10                       94.281    818.192   0.115 0.908271    
## Occupation11                      377.291    549.853   0.686 0.492669    
## Occupation12                     1048.784    351.639   2.983 0.002885 ** 
## Occupation13                     1554.941    783.309   1.985 0.047241 *  
## Occupation14                       98.677    352.712   0.280 0.779679    
## Occupation15                      391.483    556.757   0.703 0.482027    
## Occupation16                      176.804    393.509   0.449 0.653252    
## Occupation17                      373.177    305.902   1.220 0.222606    
## Occupation18                      108.216    750.671   0.144 0.885386    
## Occupation19                      734.942    620.897   1.184 0.236651    
## Occupation20                     -232.844    339.201  -0.686 0.492493    
## City_CategoryB                    -95.029    175.416  -0.542 0.588049    
## City_CategoryC                    626.541    188.306   3.327 0.000889 ***
## Stay_In_Current_City_Years1        52.973    228.083   0.232 0.816360    
## Stay_In_Current_City_Years2       135.906    252.088   0.539 0.589849    
## Stay_In_Current_City_Years3       -11.846    253.746  -0.047 0.962768    
## `Stay_In_Current_City_Years4+`   -170.979    266.215  -0.642 0.520763    
## Marital_Status1                   -86.712    153.932  -0.563 0.573269    
## Product_Category_12             -2172.708    368.077  -5.903 4.05e-09 ***
## Product_Category_13             -2435.973    553.617  -4.400 1.13e-05 ***
## Product_Category_14            -11475.741    505.880 -22.685  < 2e-16 ***
## Product_Category_15             -7853.838    287.628 -27.306  < 2e-16 ***
## Product_Category_16              1043.881    462.189   2.259 0.023995 *  
## Product_Category_18             -6913.210    426.961 -16.192  < 2e-16 ***
## Product_Category_110             7123.758    821.792   8.669  < 2e-16 ***
## Product_Category_111           -10067.593    713.421 -14.112  < 2e-16 ***
## Product_Category_113           -13526.224    870.490 -15.539  < 2e-16 ***
## Product_Category_115            -1977.339   2834.891  -0.698 0.485553    
## Product_Category_23               332.915    804.088   0.414 0.678888    
## Product_Category_24             -1690.459    510.868  -3.309 0.000949 ***
## Product_Category_25              -231.429    338.782  -0.683 0.494591    
## Product_Category_26               663.391    316.819   2.094 0.036366 *  
## Product_Category_28               476.986    333.865   1.429 0.153218    
## Product_Category_29               -43.155    665.606  -0.065 0.948310    
## Product_Category_210             4129.608    806.150   5.123 3.24e-07 ***
## Product_Category_211             -127.494    397.117  -0.321 0.748199    
## Product_Category_212             -388.661    807.664  -0.481 0.630404    
## Product_Category_213              474.004    511.630   0.926 0.354296    
## Product_Category_214               70.481    429.799   0.164 0.869755    
## Product_Category_215               67.905    364.685   0.186 0.852301    
## Product_Category_216              907.572   1218.885   0.745 0.456587    
## Product_Category_34             -2371.111   1536.313  -1.543 0.122863    
## Product_Category_35               390.476   1364.936   0.286 0.774843    
## Product_Category_36             -1168.571   1412.984  -0.827 0.408301    
## Product_Category_38               476.600   1369.703   0.348 0.727900    
## Product_Category_39              -284.523   1374.885  -0.207 0.836071    
## Product_Category_310             -825.774   1567.709  -0.527 0.598420    
## Product_Category_311            -1732.209   1474.161  -1.175 0.240085    
## Product_Category_312             -349.038   1390.019  -0.251 0.801754    
## Product_Category_313            -2075.332   1421.659  -1.460 0.144469    
## Product_Category_314             -488.529   1371.647  -0.356 0.721748    
## Product_Category_315            -1515.902   1351.144  -1.122 0.261993    
## Product_Category_316             -362.121   1365.719  -0.265 0.790915    
## Product_Category_317              142.728   1385.457   0.103 0.917956    
## Product_Category_318             -999.261   1415.266  -0.706 0.480216    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3526 on 2559 degrees of freedom
## Multiple R-squared:  0.5306, Adjusted R-squared:  0.5175 
## F-statistic: 40.74 on 71 and 2559 DF,  p-value: < 2.2e-16

Predict the test set and compute RMSE

linear_train_forecasts <- predict(linear_train, friday_test)


# Compute the prediction error, RMSE
linear_RMSE <- RMSE(linear_train_forecasts, friday_test$Purchase)
linear_RMSE
## [1] 3754.62

RMSE or also known as Root Mean Squared Error indicates the average difference between the real value and the predicted value. Therefore, it is common knowledge that the lower the better

# Compute Compute R-square
linear_Rsq <-  R2(linear_train_forecasts,  friday_test$Purchase)
linear_Rsq
## [1] 0.4745412

Suport Vector Machine Regression

Typically, Support Vector Machine or SVM is widely used to classify rather than regression. However, its method of operation is proved that it could be used for regression analysis also. The downsize of SVM is time complexity is rather high.

# Setting train control
ctrl_cv2 <- trainControl(method = "cv",
                         number = 2)

# Grid search
parametersC <- data.frame(C = c(0.01, 0.1, 0.2, 0.5, 1, 5))

# Train data
set.seed(123)
svm_Linear <- train(Purchase ~ ., 
                    data = friday_train, 
                    method = "svmLinear",
                    tuneGrid = parametersC,
                    trControl = ctrl_cv2)

# Summary
summary(svm_Linear)
## Length  Class   Mode 
##      1   ksvm     S4

Predict the test set and compute RMSE

svm_train_forecasts <- predict(svm_Linear, friday_test)

# Compute the prediction error, RMSE
svm_RMSE <- RMSE(svm_train_forecasts, friday_test$Purchase)
svm_RMSE
## [1] 3893.978

Compute Compute R-square

svm_Rsq <-  R2(svm_train_forecasts,  friday_test$Purchase)
svm_Rsq
## [1] 0.4565033

Regularization

When the data have a significant number of variables, it might be useful to find a reduced set of variables so that the performance is optimal.

In such case, imposing a penalty to the logistic regression may affect the variable( which contribute less in the model)’s coefficient move toward zero. This method is widely called regularization.

There are 3 well-known regularizations:

  • Lasso Regression penalizes the sum of absolute values (L1 penalty)

  • Ridge Regression: penalizes sum of squared coefficients (L2 penalty)

  • Net elastic: the combination of both Lasso and Ridge

This report made use of the 2 first ones with 5 fold cross validation

ctrl_cv5 <- trainControl(method = "cv",
                         number = 5)

Lasso

parameters_lasso <- expand.grid(alpha = 1, 
                                lambda = seq(0.01, 2, 0.01))
# fitted model
friday_lasso <- train(Purchase ~ .,
                      data = friday_train,
                      method = "glmnet", 
                      tuneGrid = parameters_lasso,
                      trControl = ctrl_cv5)

summary(friday_lasso)
##             Length Class      Mode     
## a0            74   -none-     numeric  
## beta        5254   dgCMatrix  S4       
## df            74   -none-     numeric  
## dim            2   -none-     numeric  
## lambda        74   -none-     numeric  
## dev.ratio     74   -none-     numeric  
## nulldev        1   -none-     numeric  
## npasses        1   -none-     numeric  
## jerr           1   -none-     numeric  
## offset         1   -none-     logical  
## call           5   -none-     call     
## nobs           1   -none-     numeric  
## lambdaOpt      1   -none-     numeric  
## xNames        71   -none-     character
## problemType    1   -none-     character
## tuneValue      2   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list
plot(friday_lasso)

RMSE constant with lambda but have a drop toward the higher number

Predict the test set and compute RMSE

lasso_train_forecasts <- predict(friday_lasso, friday_test)

# Compute the prediction error, RMSE
lasso_RMSE <- RMSE(lasso_train_forecasts, friday_test$Purchase)
lasso_RMSE
## [1] 3752.993

Compute Compute R-square

lasso_Rsq <-  R2(lasso_train_forecasts,  friday_test$Purchase)
lasso_Rsq
## [1] 0.4747517

Ridge

parameters_ridge <- expand.grid(alpha = 0, # ridge 
                                lambda = seq(0.01, 2, 0.01))
# fitted model
set.seed(123)
friday_ridge <- train(Purchase ~ .,
                     data = friday_train,
                     method = "glmnet", 
                     tuneGrid = parameters_ridge,
                     trControl = ctrl_cv5)

summary(friday_ridge)
##             Length Class      Mode     
## a0           100   -none-     numeric  
## beta        7100   dgCMatrix  S4       
## df           100   -none-     numeric  
## dim            2   -none-     numeric  
## lambda       100   -none-     numeric  
## dev.ratio    100   -none-     numeric  
## nulldev        1   -none-     numeric  
## npasses        1   -none-     numeric  
## jerr           1   -none-     numeric  
## offset         1   -none-     logical  
## call           5   -none-     call     
## nobs           1   -none-     numeric  
## lambdaOpt      1   -none-     numeric  
## xNames        71   -none-     character
## problemType    1   -none-     character
## tuneValue      2   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list
plot(friday_ridge) 

RMSE constant with lambda

Predict the test set and compute RMSE

ridge_train_forecasts <- predict(friday_ridge, friday_test)

# Compute the prediction error, RMSE
ridge_RMSE <- RMSE(ridge_train_forecasts, friday_test$Purchase)
ridge_RMSE
## [1] 3745.58

Compute Compute R-square

rigde_Rsq <-  R2(ridge_train_forecasts,  friday_test$Purchase)
rigde_Rsq
## [1] 0.4752714

Stepwise methods.

Multiple predictors often causes more problem in regression due to overfit, on the other hand, including less predictors causes underfit. Instead of specify which variable should be used, it is better to let the algorithm do the job. This is the basic idea of stepwise methods.

There are 3 schools of this strategy:

  • Forward selection, the model starts with formula with no predictor then add one by one
  • Backward elimination, inverse of forward selection, starting with all predictors then remove one by one
  • Stepwise selection: a combination of forward selection and backward elimination.

This report impletmented Stepwise selection as an example.

ctrl_nocv <- trainControl(method = 'none')
friday_step <- train(Purchase ~ .,
        data = friday_train, 
        # stepwise method
        method = "lmStepAIC",
        # additional argument
        direction = "both", 
        trControl = ctrl_nocv)
summary(friday_step)
## 
## Call:
## lm(formula = .outcome ~ Occupation4 + Occupation12 + Occupation13 + 
##     Occupation20 + City_CategoryC + Product_Category_12 + Product_Category_13 + 
##     Product_Category_14 + Product_Category_15 + Product_Category_16 + 
##     Product_Category_18 + Product_Category_110 + Product_Category_111 + 
##     Product_Category_113 + Product_Category_24 + Product_Category_26 + 
##     Product_Category_28 + Product_Category_210 + Product_Category_34 + 
##     Product_Category_36 + Product_Category_39 + Product_Category_311 + 
##     Product_Category_312 + Product_Category_313 + Product_Category_314 + 
##     Product_Category_315 + Product_Category_316 + Product_Category_318, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11904.4  -2026.6    400.3   2330.9   7415.7 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           14455.2      175.6  82.330  < 2e-16 ***
## Occupation4            -377.0      199.9  -1.886 0.059360 .  
## Occupation12            853.2      301.8   2.827 0.004732 ** 
## Occupation13           1303.6      710.6   1.834 0.066708 .  
## Occupation20           -414.9      288.4  -1.438 0.150434    
## City_CategoryC          695.2      148.3   4.687 2.92e-06 ***
## Product_Category_12   -2260.1      291.9  -7.742 1.39e-14 ***
## Product_Category_13   -2523.6      503.4  -5.013 5.71e-07 ***
## Product_Category_14  -11559.5      437.4 -26.428  < 2e-16 ***
## Product_Category_15   -7853.6      253.7 -30.962  < 2e-16 ***
## Product_Category_16     895.3      414.4   2.161 0.030816 *  
## Product_Category_18   -6816.1      311.6 -21.875  < 2e-16 ***
## Product_Category_110   7278.5      773.0   9.416  < 2e-16 ***
## Product_Category_111  -9851.8      637.8 -15.447  < 2e-16 ***
## Product_Category_113 -13477.9      847.0 -15.912  < 2e-16 ***
## Product_Category_24   -1484.0      440.8  -3.367 0.000772 ***
## Product_Category_26     759.0      273.8   2.773 0.005601 ** 
## Product_Category_28     469.5      231.1   2.032 0.042287 *  
## Product_Category_210   4301.8      758.8   5.669 1.59e-08 ***
## Product_Category_34   -2306.0      598.7  -3.851 0.000120 ***
## Product_Category_36   -1557.1      432.4  -3.601 0.000323 ***
## Product_Category_39    -533.9      324.8  -1.644 0.100305    
## Product_Category_311  -1927.4      631.5  -3.052 0.002295 ** 
## Product_Category_312   -717.8      344.2  -2.085 0.037165 *  
## Product_Category_313  -2286.6      484.7  -4.718 2.51e-06 ***
## Product_Category_314   -757.1      274.1  -2.762 0.005778 ** 
## Product_Category_315  -1757.1      222.9  -7.881 4.73e-15 ***
## Product_Category_316   -533.3      224.3  -2.378 0.017496 *  
## Product_Category_318  -1296.3      433.8  -2.988 0.002831 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3507 on 2602 degrees of freedom
## Multiple R-squared:  0.5277, Adjusted R-squared:  0.5226 
## F-statistic: 103.8 on 28 and 2602 DF,  p-value: < 2.2e-16

Predict the test set and compute RMSE

step_train_forecasts <- predict(friday_step, friday_test)

# Compute the prediction error, RMSE
step_RMSE <- RMSE(step_train_forecasts, friday_test$Purchase)
step_RMSE
## [1] 3762.547

Compute Compute R-square

# Compute Compute R-square
step_Rsq <-  R2(step_train_forecasts,  friday_test$Purchase)
step_Rsq
## [1] 0.472461

Final result and conclusion

result_df <- data.frame(RMSE = c(linear_RMSE, svm_RMSE, ridge_RMSE, lasso_RMSE, step_RMSE),
                        R_squared = c(linear_Rsq, svm_Rsq, rigde_Rsq, lasso_Rsq, step_Rsq))
row.names(result_df)<- c('Linear Regression','Support Vector Machine Regression', 
                         'Ridge Regression', 'Lasso Regression', 'Stepwise Regression')
result_df
##                                       RMSE R_squared
## Linear Regression                 3754.620 0.4745412
## Support Vector Machine Regression 3893.978 0.4565033
## Ridge Regression                  3745.580 0.4752714
## Lasso Regression                  3752.993 0.4747517
## Stepwise Regression               3762.547 0.4724610

From the 5 models, the Ridge Regression seems to have the lowest RMSE and highest R squared. However, all of R squared above show that the models is not good enough, perhaps, the problem lies on the limited number or the limited types of predictors. Furthermore, original data have huge number observations, as a result, algorithms may perform better if they could train with more data. (Which tends to be problematic with normal consumer pc specs).