The following Friday after Thankgiving Day in United States have an infamous name: Black Friday. On this day, stores ofen promoted their products at shocking attractive prices, which impluses customers to reach out their wallet and be extravagant. Though it is not an offical holiday, as the first one beginned in 1952, it quickly gain its popularity and widespread across United States, and in recent years, globally.
This project made an attempt to predict sales number of each customer spent in this special day. Though estimating number is difficult, as the sales number varies depend on many factors, simply getting a minor correction will provide an insight on business status.
The data was retrieved from Kaggle website as Black Friday dataset. which comes from a sample record of transaction in one retail store in that day.
The data set includes 550 000 observations, and 12 columns, however, under some circumstance, only 3287 completed observations with 10 sensible variables is selected as the main data set. Though this is a massive reduction, having smaller data will improve the speed of model. Furthermore, this is a showcase for demonstration purpose and not for production.
The data after being preprocessed:
## Observations: 3,287
## Variables: 10
## $ Gender <fct> F, M, M, M, M, M, M, M, M, M, F, M,...
## $ Age <fct> 36-45, 36-45, 26-35, 18-25, 26-35, ...
## $ Occupation <fct> 1, 17, 17, 2, 7, 13, 16, 1, 4, 13, ...
## $ City_Category <fct> B, C, A, B, B, B, B, B, B, C, B, B,...
## $ Stay_In_Current_City_Years <fct> 4+, 4+, 2, 3, 1, 1, 0, 4+, 4+, 1, 2...
## $ Marital_Status <fct> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Product_Category_1 <fct> 6, 6, 1, 5, 1, 2, 1, 1, 8, 1, 5, 1,...
## $ Product_Category_2 <fct> 10, 8, 2, 8, 2, 3, 8, 2, 14, 2, 11,...
## $ Product_Category_3 <fct> 13, 16, 15, 14, 9, 10, 17, 15, 17, ...
## $ Purchase <int> 12642, 12486, 8300, 3613, 19647, 34...
Infomation about variables:
Gender: classifed as Boolean, Female or Male
Age: customers is divived into 5 age groups start from “0-17” “18-25” “26-35” “36-45” “46-50” “51-55” “55+”
Occupation: for discreet reason, they are listed as factor from 0 to 20
Occupation: This columns contain ID of 20 job titles
City_Category: for discreet reason, they are listed as factor “A” “B” “C”
Stay_In_Current_City_Years: The number of year that customers have lived in the current city, whether they are long term resident, short term or just tourist. The value estimate as factor “0” “1” “2” “3” “4+”
Marital_Status: indicators whether single or marriage
Product_Category_1, Product_Category_2, Product_Category_3: masked product, this infomation is not specified by contributor
Purchase: Purchase amount (probably in US dollar)
The dependent variable column is the only numeric columns in this data
Meanwhile the rest columns are factors:
## Warning: attributes are not identical across measure variables;
## they will be dropped
Since the data is preprocessed, there is no mssing value
aggr(blackFriday,
numbers = TRUE,
prop = FALSE,
sortVars = TRUE,
cex.axis = .5,
gap = 2,
ylab = c("Number of misisngs", "Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Gender 0
## Age 0
## Occupation 0
## City_Category 0
## Stay_In_Current_City_Years 0
## Marital_Status 0
## Product_Category_1 0
## Product_Category_2 0
## Product_Category_3 0
## Purchase 0
Data Partitioning split into 80/20
set.seed(123)
which_train <- createDataPartition(blackFriday$Purchase,
p = 0.8,
list = FALSE)
friday_train <- blackFriday[which_train,]
friday_test <- blackFriday[-which_train,]
When it comes to Regression Analysis, Linear model retains its popularity as the first and foremost to be built. In this case, since there are many predictors, multiple Linear Regression is the most suitable. This model is expressed by formula \(y = b0 + b1*x1 + b2*x2 + b3*x3\) if it exits 3 predictors.
linear_train <- train(Purchase ~ ., data = friday_train, method = "lm")
summary(linear_train)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12373.0 -2027.7 370.8 2300.9 7635.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14105.804 1552.465 9.086 < 2e-16 ***
## GenderM 19.027 175.664 0.108 0.913752
## `Age18-25` 60.510 760.384 0.080 0.936579
## `Age26-35` -1.125 754.915 -0.001 0.998811
## `Age36-45` 19.654 762.238 0.026 0.979431
## `Age46-50` -123.032 792.078 -0.155 0.876574
## `Age51-55` 82.104 806.601 0.102 0.918931
## `Age55+` -72.634 860.340 -0.084 0.932725
## Occupation1 131.756 310.239 0.425 0.671098
## Occupation2 -162.304 385.586 -0.421 0.673843
## Occupation3 654.059 470.786 1.389 0.164864
## Occupation4 -268.258 284.249 -0.944 0.345390
## Occupation5 286.821 530.067 0.541 0.588483
## Occupation6 495.057 409.215 1.210 0.226478
## Occupation7 162.735 287.642 0.566 0.571609
## Occupation8 -604.656 1173.220 -0.515 0.606331
## Occupation9 -370.920 720.199 -0.515 0.606581
## Occupation10 94.281 818.192 0.115 0.908271
## Occupation11 377.291 549.853 0.686 0.492669
## Occupation12 1048.784 351.639 2.983 0.002885 **
## Occupation13 1554.941 783.309 1.985 0.047241 *
## Occupation14 98.677 352.712 0.280 0.779679
## Occupation15 391.483 556.757 0.703 0.482027
## Occupation16 176.804 393.509 0.449 0.653252
## Occupation17 373.177 305.902 1.220 0.222606
## Occupation18 108.216 750.671 0.144 0.885386
## Occupation19 734.942 620.897 1.184 0.236651
## Occupation20 -232.844 339.201 -0.686 0.492493
## City_CategoryB -95.029 175.416 -0.542 0.588049
## City_CategoryC 626.541 188.306 3.327 0.000889 ***
## Stay_In_Current_City_Years1 52.973 228.083 0.232 0.816360
## Stay_In_Current_City_Years2 135.906 252.088 0.539 0.589849
## Stay_In_Current_City_Years3 -11.846 253.746 -0.047 0.962768
## `Stay_In_Current_City_Years4+` -170.979 266.215 -0.642 0.520763
## Marital_Status1 -86.712 153.932 -0.563 0.573269
## Product_Category_12 -2172.708 368.077 -5.903 4.05e-09 ***
## Product_Category_13 -2435.973 553.617 -4.400 1.13e-05 ***
## Product_Category_14 -11475.741 505.880 -22.685 < 2e-16 ***
## Product_Category_15 -7853.838 287.628 -27.306 < 2e-16 ***
## Product_Category_16 1043.881 462.189 2.259 0.023995 *
## Product_Category_18 -6913.210 426.961 -16.192 < 2e-16 ***
## Product_Category_110 7123.758 821.792 8.669 < 2e-16 ***
## Product_Category_111 -10067.593 713.421 -14.112 < 2e-16 ***
## Product_Category_113 -13526.224 870.490 -15.539 < 2e-16 ***
## Product_Category_115 -1977.339 2834.891 -0.698 0.485553
## Product_Category_23 332.915 804.088 0.414 0.678888
## Product_Category_24 -1690.459 510.868 -3.309 0.000949 ***
## Product_Category_25 -231.429 338.782 -0.683 0.494591
## Product_Category_26 663.391 316.819 2.094 0.036366 *
## Product_Category_28 476.986 333.865 1.429 0.153218
## Product_Category_29 -43.155 665.606 -0.065 0.948310
## Product_Category_210 4129.608 806.150 5.123 3.24e-07 ***
## Product_Category_211 -127.494 397.117 -0.321 0.748199
## Product_Category_212 -388.661 807.664 -0.481 0.630404
## Product_Category_213 474.004 511.630 0.926 0.354296
## Product_Category_214 70.481 429.799 0.164 0.869755
## Product_Category_215 67.905 364.685 0.186 0.852301
## Product_Category_216 907.572 1218.885 0.745 0.456587
## Product_Category_34 -2371.111 1536.313 -1.543 0.122863
## Product_Category_35 390.476 1364.936 0.286 0.774843
## Product_Category_36 -1168.571 1412.984 -0.827 0.408301
## Product_Category_38 476.600 1369.703 0.348 0.727900
## Product_Category_39 -284.523 1374.885 -0.207 0.836071
## Product_Category_310 -825.774 1567.709 -0.527 0.598420
## Product_Category_311 -1732.209 1474.161 -1.175 0.240085
## Product_Category_312 -349.038 1390.019 -0.251 0.801754
## Product_Category_313 -2075.332 1421.659 -1.460 0.144469
## Product_Category_314 -488.529 1371.647 -0.356 0.721748
## Product_Category_315 -1515.902 1351.144 -1.122 0.261993
## Product_Category_316 -362.121 1365.719 -0.265 0.790915
## Product_Category_317 142.728 1385.457 0.103 0.917956
## Product_Category_318 -999.261 1415.266 -0.706 0.480216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3526 on 2559 degrees of freedom
## Multiple R-squared: 0.5306, Adjusted R-squared: 0.5175
## F-statistic: 40.74 on 71 and 2559 DF, p-value: < 2.2e-16
Predict the test set and compute RMSE
linear_train_forecasts <- predict(linear_train, friday_test)
# Compute the prediction error, RMSE
linear_RMSE <- RMSE(linear_train_forecasts, friday_test$Purchase)
linear_RMSE
## [1] 3754.62
RMSE or also known as Root Mean Squared Error indicates the average difference between the real value and the predicted value. Therefore, it is common knowledge that the lower the better
# Compute Compute R-square
linear_Rsq <- R2(linear_train_forecasts, friday_test$Purchase)
linear_Rsq
## [1] 0.4745412
Typically, Support Vector Machine or SVM is widely used to classify rather than regression. However, its method of operation is proved that it could be used for regression analysis also. The downsize of SVM is time complexity is rather high.
# Setting train control
ctrl_cv2 <- trainControl(method = "cv",
number = 2)
# Grid search
parametersC <- data.frame(C = c(0.01, 0.1, 0.2, 0.5, 1, 5))
# Train data
set.seed(123)
svm_Linear <- train(Purchase ~ .,
data = friday_train,
method = "svmLinear",
tuneGrid = parametersC,
trControl = ctrl_cv2)
# Summary
summary(svm_Linear)
## Length Class Mode
## 1 ksvm S4
Predict the test set and compute RMSE
svm_train_forecasts <- predict(svm_Linear, friday_test)
# Compute the prediction error, RMSE
svm_RMSE <- RMSE(svm_train_forecasts, friday_test$Purchase)
svm_RMSE
## [1] 3893.978
Compute Compute R-square
svm_Rsq <- R2(svm_train_forecasts, friday_test$Purchase)
svm_Rsq
## [1] 0.4565033
When the data have a significant number of variables, it might be useful to find a reduced set of variables so that the performance is optimal.
In such case, imposing a penalty to the logistic regression may affect the variable( which contribute less in the model)’s coefficient move toward zero. This method is widely called regularization.
There are 3 well-known regularizations:
Lasso Regression penalizes the sum of absolute values (L1 penalty)
Ridge Regression: penalizes sum of squared coefficients (L2 penalty)
Net elastic: the combination of both Lasso and Ridge
This report made use of the 2 first ones with 5 fold cross validation
ctrl_cv5 <- trainControl(method = "cv",
number = 5)
parameters_lasso <- expand.grid(alpha = 1,
lambda = seq(0.01, 2, 0.01))
# fitted model
friday_lasso <- train(Purchase ~ .,
data = friday_train,
method = "glmnet",
tuneGrid = parameters_lasso,
trControl = ctrl_cv5)
summary(friday_lasso)
## Length Class Mode
## a0 74 -none- numeric
## beta 5254 dgCMatrix S4
## df 74 -none- numeric
## dim 2 -none- numeric
## lambda 74 -none- numeric
## dev.ratio 74 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## lambdaOpt 1 -none- numeric
## xNames 71 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
plot(friday_lasso)
RMSE constant with lambda but have a drop toward the higher number
Predict the test set and compute RMSE
lasso_train_forecasts <- predict(friday_lasso, friday_test)
# Compute the prediction error, RMSE
lasso_RMSE <- RMSE(lasso_train_forecasts, friday_test$Purchase)
lasso_RMSE
## [1] 3752.993
Compute Compute R-square
lasso_Rsq <- R2(lasso_train_forecasts, friday_test$Purchase)
lasso_Rsq
## [1] 0.4747517
parameters_ridge <- expand.grid(alpha = 0, # ridge
lambda = seq(0.01, 2, 0.01))
# fitted model
set.seed(123)
friday_ridge <- train(Purchase ~ .,
data = friday_train,
method = "glmnet",
tuneGrid = parameters_ridge,
trControl = ctrl_cv5)
summary(friday_ridge)
## Length Class Mode
## a0 100 -none- numeric
## beta 7100 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## lambdaOpt 1 -none- numeric
## xNames 71 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
plot(friday_ridge)
RMSE constant with lambda
Predict the test set and compute RMSE
ridge_train_forecasts <- predict(friday_ridge, friday_test)
# Compute the prediction error, RMSE
ridge_RMSE <- RMSE(ridge_train_forecasts, friday_test$Purchase)
ridge_RMSE
## [1] 3745.58
Compute Compute R-square
rigde_Rsq <- R2(ridge_train_forecasts, friday_test$Purchase)
rigde_Rsq
## [1] 0.4752714
Multiple predictors often causes more problem in regression due to overfit, on the other hand, including less predictors causes underfit. Instead of specify which variable should be used, it is better to let the algorithm do the job. This is the basic idea of stepwise methods.
There are 3 schools of this strategy:
This report impletmented Stepwise selection as an example.
ctrl_nocv <- trainControl(method = 'none')
friday_step <- train(Purchase ~ .,
data = friday_train,
# stepwise method
method = "lmStepAIC",
# additional argument
direction = "both",
trControl = ctrl_nocv)
summary(friday_step)
##
## Call:
## lm(formula = .outcome ~ Occupation4 + Occupation12 + Occupation13 +
## Occupation20 + City_CategoryC + Product_Category_12 + Product_Category_13 +
## Product_Category_14 + Product_Category_15 + Product_Category_16 +
## Product_Category_18 + Product_Category_110 + Product_Category_111 +
## Product_Category_113 + Product_Category_24 + Product_Category_26 +
## Product_Category_28 + Product_Category_210 + Product_Category_34 +
## Product_Category_36 + Product_Category_39 + Product_Category_311 +
## Product_Category_312 + Product_Category_313 + Product_Category_314 +
## Product_Category_315 + Product_Category_316 + Product_Category_318,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11904.4 -2026.6 400.3 2330.9 7415.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14455.2 175.6 82.330 < 2e-16 ***
## Occupation4 -377.0 199.9 -1.886 0.059360 .
## Occupation12 853.2 301.8 2.827 0.004732 **
## Occupation13 1303.6 710.6 1.834 0.066708 .
## Occupation20 -414.9 288.4 -1.438 0.150434
## City_CategoryC 695.2 148.3 4.687 2.92e-06 ***
## Product_Category_12 -2260.1 291.9 -7.742 1.39e-14 ***
## Product_Category_13 -2523.6 503.4 -5.013 5.71e-07 ***
## Product_Category_14 -11559.5 437.4 -26.428 < 2e-16 ***
## Product_Category_15 -7853.6 253.7 -30.962 < 2e-16 ***
## Product_Category_16 895.3 414.4 2.161 0.030816 *
## Product_Category_18 -6816.1 311.6 -21.875 < 2e-16 ***
## Product_Category_110 7278.5 773.0 9.416 < 2e-16 ***
## Product_Category_111 -9851.8 637.8 -15.447 < 2e-16 ***
## Product_Category_113 -13477.9 847.0 -15.912 < 2e-16 ***
## Product_Category_24 -1484.0 440.8 -3.367 0.000772 ***
## Product_Category_26 759.0 273.8 2.773 0.005601 **
## Product_Category_28 469.5 231.1 2.032 0.042287 *
## Product_Category_210 4301.8 758.8 5.669 1.59e-08 ***
## Product_Category_34 -2306.0 598.7 -3.851 0.000120 ***
## Product_Category_36 -1557.1 432.4 -3.601 0.000323 ***
## Product_Category_39 -533.9 324.8 -1.644 0.100305
## Product_Category_311 -1927.4 631.5 -3.052 0.002295 **
## Product_Category_312 -717.8 344.2 -2.085 0.037165 *
## Product_Category_313 -2286.6 484.7 -4.718 2.51e-06 ***
## Product_Category_314 -757.1 274.1 -2.762 0.005778 **
## Product_Category_315 -1757.1 222.9 -7.881 4.73e-15 ***
## Product_Category_316 -533.3 224.3 -2.378 0.017496 *
## Product_Category_318 -1296.3 433.8 -2.988 0.002831 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3507 on 2602 degrees of freedom
## Multiple R-squared: 0.5277, Adjusted R-squared: 0.5226
## F-statistic: 103.8 on 28 and 2602 DF, p-value: < 2.2e-16
Predict the test set and compute RMSE
step_train_forecasts <- predict(friday_step, friday_test)
# Compute the prediction error, RMSE
step_RMSE <- RMSE(step_train_forecasts, friday_test$Purchase)
step_RMSE
## [1] 3762.547
Compute Compute R-square
# Compute Compute R-square
step_Rsq <- R2(step_train_forecasts, friday_test$Purchase)
step_Rsq
## [1] 0.472461
result_df <- data.frame(RMSE = c(linear_RMSE, svm_RMSE, ridge_RMSE, lasso_RMSE, step_RMSE),
R_squared = c(linear_Rsq, svm_Rsq, rigde_Rsq, lasso_Rsq, step_Rsq))
row.names(result_df)<- c('Linear Regression','Support Vector Machine Regression',
'Ridge Regression', 'Lasso Regression', 'Stepwise Regression')
result_df
## RMSE R_squared
## Linear Regression 3754.620 0.4745412
## Support Vector Machine Regression 3893.978 0.4565033
## Ridge Regression 3745.580 0.4752714
## Lasso Regression 3752.993 0.4747517
## Stepwise Regression 3762.547 0.4724610
From the 5 models, the Ridge Regression seems to have the lowest RMSE and highest R squared. However, all of R squared above show that the models is not good enough, perhaps, the problem lies on the limited number or the limited types of predictors. Furthermore, original data have huge number observations, as a result, algorithms may perform better if they could train with more data. (Which tends to be problematic with normal consumer pc specs).