1 Executive Summary

ABC Private Limited is a retail company that would like to understand the purchasing habits of their customers so that they can offer a personalized list of products that would interest those customers. This report will explore a month’s worth of sales data from ABC Private Limited. The variable of interest in this report is the amount purchased in the last month. This data also contains demographic information on customers that shopped at ABC Private Limited in the previous month. This demographic information includes age, gender, occupation, city category, stay in current city and marital status. It also contains information on the products purchased, such as their ID and different product category information. Using this information, a machine learning model will be built that will predict the purchase amount based on the customer’s demographics and the categories of the products. In the end, ABC Private Unlimited will have information that they can use to offer customers products that will be appropriate for them.

2 Exploratory Analysis

The first step in exploring the data is loading the libraries that will be used for analysis. Next the data was stored in the variable black_friday. Inspecting the data reveals that every variable other than Purchase is a categorical variable. However, there are some columns that are stored as numerical data. The columns that needed changing were Marital_Status, Occupation, User_ID, Product_Category_1, Product_Category_2 and Product_Category_3. So, these columns were changed to factor variables.

Another problem with the data is that Product_Category_2 and Product_Category_3 have missing values. Since these columns contain categorical data, it would not be appropriate to use median or kNN imputation. However, a value is necessary where one is missing for the machine learning phase of this report. Therefore the value 0 is imputed for the missing data.

Next, the fully clean dataset is examined. It reveals that the median purchase amount for last month was $8,047. In addition, it shows that the mean was $9,264. The minimum purchase amount was $12 and the maximum purchase amount was $23,961. A histogram of the purchase variable shows a unimodal curve that has a positive skew. This positive skew explains why the mean of the purchase amount variable is larger than the median of purchase amount variable.

Further exploration reveals that the when viewing histograms of the purchase amount for the previous month broken down using other variables, the median does not move much. There are times when the histogram will have a larger count. However, in the case of gender, that is to be expected because there were more male customers. There were 135,809 female customers, but there were 414,259 male customers. The same is true when looking at purchase amount broken down by age. The 26-35 year old group made the most purchases, which explains why the count was greater in the Purchases Histogram by Age graph. Interestingly, all histograms reveal a similar median and a similar positive skew. However, people who shop most at ABC Private Limited are males aged 18-45. Additionally they are usually of the marital status 0. Unfortunately, only those at ABC Private Limited know what that means. Additionally, shoppers at ABC Private Limited usually are new to their city and have the City Category B.

3 Machine Learning Models

This report uses machine learning in order to build a prediction model that will predict a customer’s purchase amount. The first step to building this model is removing the User_ID and Product_ID columns. These columns were removed because they are variables that have zero variance. Each Product_ID and each User_ID will be particular to the customer or product.

After these near zero variance variables were removed, a sample was selected from the data. This sample was selected because this dataset is large. Its size may cause trouble when building a machine learning model. Fortunately, a large sample of the data should be enough to represent the data accurately and to build a machine learning model. The sample that was selected was selected randomly. The sample size 10% of the data. Once the sample is selected, the next step in building the model can be performed.

Next the sampled data was partitioned. 70% of the sample was selected randomly to be the train set. The train set is the data that will be used to build the algorithm. The other 30% of the sample was assigned to the test set. The test set will be used to test the accuracy of the model.

Once the sample is partitioned, a list of machine learning algorithms using different methods is compiled. These algorithms were built using ten-fold repeated cross-validation with three repeats. The models created include a glm model, a glmnet model, a linear regression model, a GBM model and a treebag model. The model that produced the best median RMSE was the gbm model. This model produced a median RMSE of 3002.142. Using this model would produce fairly accurate predictions. However, creating an ensemble model using these models should produce even greater accuracy.

Next, three ensemble models were created. These ensemble models included a linear regression model, a glmnet model and a gbm model. Each model also used ten-fold cross-validation repeated three times. All three of these models proved to be better predictors than any of the other models alone. The gbm model produced an RMSE of 2980.54. The linear model produced an RMSE of 2993.22 and the glmnet model produced an RMSE of 2993.751. Each of these models will be tested on the testing data set to see how well each model predicts the purchase variable.

4 Conclusion

There are a few conclusions that can be made using the analysis in this paper. The first conclusion is that even when broken down into different demographics, the median purchase made by customers does not fluctuate much. It didn’t matter if the group was male, female, young, old, married or unmarried, the median purchase by the customers hovered around $8000. However, some groups were more present than others. Males shopped more than females. The marital status 0 shopped more than the marital status 1. Unfortunately, which label mean married and which label means unmarried is unknown. Also customers between the ages of 18 and 45 shopped the most. The age range 26-35 had the highest turnout. Additionally, people who only lived in their city for a year shopped a lot.

So there are two different ways that ABC Private Limited could increase their sales. One way would be for them to advertise to groups that do not shop as often as the other groups. The thinking here is that no matter what, median purchase will not change much. However, these unrepresented groups could be represented more and increase purchases. The other option would be for ABC Private Limited to target the demographics that they know show up most often. The thinking here is that these customers produce the largest part of the customer base. Therefore they would be the people most likely to shop in the store.

Finally, the models were tested to find the model that makes the best predictions. When analyzing the data from the testing model, it is revealed that the Product_Category_1, Product_Category_2 and Product_Category_3 variables have new levels. These new variables will present a problem when making predictions. Therefore, the original models needed to be revisited and these variables were left as numeric variables after 0 was imputed for the missing values. Once the different models were used to make predictions, it is discovered the the glmnet stack produced the best predictions. This model produced an RMSE of 3085.436. Therefore, it was the model used to make the final predictions for the testing dataset.

5 Appendix

5.1 Exploratory Analysis

5.1.1 Loading Libraries

#Loading Libraries
library(dplyr)
library(ggplot2)
library(caret)
library(caretEnsemble)
library(VIM)
library(gridExtra)

5.1.2 Loading Data

#Loading Data
black_friday <- read.csv("train.csv")

5.1.3 Previewing Data

5.1.3.1 First Six Rows

head(black_friday)
##   User_ID Product_ID Gender   Age Occupation City_Category
## 1 1000001  P00069042      F  0-17         10             A
## 2 1000001  P00248942      F  0-17         10             A
## 3 1000001  P00087842      F  0-17         10             A
## 4 1000001  P00085442      F  0-17         10             A
## 5 1000002  P00285442      M   55+         16             C
## 6 1000003  P00193542      M 26-35         15             A
##   Stay_In_Current_City_Years Marital_Status Product_Category_1
## 1                          2              0                  3
## 2                          2              0                  1
## 3                          2              0                 12
## 4                          2              0                 12
## 5                         4+              0                  8
## 6                          3              0                  1
##   Product_Category_2 Product_Category_3 Purchase
## 1                 NA                 NA     8370
## 2                  6                 14    15200
## 3                 NA                 NA     1422
## 4                 14                 NA     1057
## 5                 NA                 NA     7969
## 6                  2                 NA    15227

5.1.3.2 Structure of Data

str(black_friday)
## 'data.frame':    550068 obs. of  12 variables:
##  $ User_ID                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ Product_ID                : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 1832 1746 3321 3605 2632 ...
##  $ Gender                    : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
##  $ Age                       : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
##  $ Occupation                : int  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_Category             : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
##  $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
##  $ Marital_Status            : int  0 0 0 0 0 0 1 1 1 1 ...
##  $ Product_Category_1        : int  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : int  NA 6 NA 14 NA 2 8 15 16 NA ...
##  $ Product_Category_3        : int  NA 14 NA NA NA NA 17 NA NA NA ...
##  $ Purchase                  : int  8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...

5.1.3.3 Summary of Data

summary(black_friday)
##     User_ID            Product_ID     Gender        Age        
##  Min.   :1000001   P00265242:  1880   F:135809   0-17 : 15102  
##  1st Qu.:1001516   P00025442:  1615   M:414259   18-25: 99660  
##  Median :1003077   P00110742:  1612              26-35:219587  
##  Mean   :1003029   P00112142:  1562              36-45:110013  
##  3rd Qu.:1004478   P00057642:  1470              46-50: 45701  
##  Max.   :1006040   P00184942:  1440              51-55: 38501  
##                    (Other)  :540489              55+  : 21504  
##    Occupation     City_Category Stay_In_Current_City_Years
##  Min.   : 0.000   A:147720      0 : 74398                 
##  1st Qu.: 2.000   B:231173      1 :193821                 
##  Median : 7.000   C:171175      2 :101838                 
##  Mean   : 8.077                 3 : 95285                 
##  3rd Qu.:14.000                 4+: 84726                 
##  Max.   :20.000                                           
##                                                           
##  Marital_Status   Product_Category_1 Product_Category_2 Product_Category_3
##  Min.   :0.0000   Min.   : 1.000     Min.   : 2.00      Min.   : 3.0      
##  1st Qu.:0.0000   1st Qu.: 1.000     1st Qu.: 5.00      1st Qu.: 9.0      
##  Median :0.0000   Median : 5.000     Median : 9.00      Median :14.0      
##  Mean   :0.4097   Mean   : 5.404     Mean   : 9.84      Mean   :12.7      
##  3rd Qu.:1.0000   3rd Qu.: 8.000     3rd Qu.:15.00      3rd Qu.:16.0      
##  Max.   :1.0000   Max.   :20.000     Max.   :18.00      Max.   :18.0      
##                                      NA's   :173638     NA's   :383247    
##     Purchase    
##  Min.   :   12  
##  1st Qu.: 5823  
##  Median : 8047  
##  Mean   : 9264  
##  3rd Qu.:12054  
##  Max.   :23961  
## 

5.1.3.4 Changing Numeric Variables to Categorical Variables

#Converting Marital Status Column
black_friday$Marital_Status <- factor(black_friday$Marital_Status)

#Converting Occupation Column
black_friday$Occupation <- factor(black_friday$Occupation)

#Converting User_ID Column
black_friday$User_ID <- factor(black_friday$User_ID)

#Converting Product_Category_2 Column
black_friday$Product_Category_2 <- 
factor(black_friday$Product_Category_2)

#Converting Product_Category_3 Column
black_friday$Product_Category_3 <- factor(black_friday$Product_Category_3)

5.1.3.5 New Summary of Data

summary(black_friday)
##     User_ID           Product_ID     Gender        Age        
##  1001680:  1026   P00265242:  1880   F:135809   0-17 : 15102  
##  1004277:   979   P00025442:  1615   M:414259   18-25: 99660  
##  1001941:   898   P00110742:  1612              26-35:219587  
##  1001181:   862   P00112142:  1562              36-45:110013  
##  1000889:   823   P00057642:  1470              46-50: 45701  
##  1003618:   767   P00184942:  1440              51-55: 38501  
##  (Other):544713   (Other)  :540489              55+  : 21504  
##    Occupation     City_Category Stay_In_Current_City_Years Marital_Status
##  4      : 72308   A:147720      0 : 74398                  0:324731      
##  0      : 69638   B:231173      1 :193821                  1:225337      
##  7      : 59133   C:171175      2 :101838                                
##  1      : 47426                 3 : 95285                                
##  17     : 40043                 4+: 84726                                
##  20     : 33562                                                          
##  (Other):227958                                                          
##  Product_Category_1 Product_Category_2 Product_Category_3    Purchase    
##  Min.   : 1.000     8      : 64088     16     : 32636     Min.   :   12  
##  1st Qu.: 1.000     14     : 55108     15     : 28013     1st Qu.: 5823  
##  Median : 5.000     2      : 49217     14     : 18428     Median : 8047  
##  Mean   : 5.404     16     : 43255     17     : 16702     Mean   : 9264  
##  3rd Qu.: 8.000     15     : 37855     5      : 16658     3rd Qu.:12054  
##  Max.   :20.000     (Other):126907     (Other): 54384     Max.   :23961  
##                     NA's   :173638     NA's   :383247

5.1.3.6 Checking Spread of Product_Category_1

table(black_friday$Product_Category_1)
## 
##      1      2      3      4      5      6      7      8      9     10 
## 140378  23864  20213  11753 150933  20466   3721 113925    410   5125 
##     11     12     13     14     15     16     17     18     19     20 
##  24287   3947   5549   1523   6290   9828    578   3125   1603   2550

5.1.3.7 Checking Spread of Product_Category_2

table(black_friday$Product_Category_2)
## 
##     2     3     4     5     6     7     8     9    10    11    12    13 
## 49217  2884 25677 26235 16466   626 64088  5693  3043 14134  5528 10531 
##    14    15    16    17    18 
## 55108 37855 43255 13320  2770

5.1.3.8 Checking Spread of Product_Category_3

table(black_friday$Product_Category_3)
## 
##     3     4     5     6     8     9    10    11    12    13    14    15 
##   613  1875 16658  4890 12562 11579  1726  1805  9246  5459 18428 28013 
##    16    17    18 
## 32636 16702  4629

5.1.3.9 Imputing 0 For NA in Product_Category Columns

#Imputing 0 for NA in Product_Category_2
black_friday$Product_Category_2 <- as.numeric(black_friday$Product_Category_2)

black_friday[is.na(black_friday$Product_Category_2), "Product_Category_2"] <- 0

#Imputing 0 For NA in Product_Category_3
black_friday$Product_Category_3 <- as.numeric(black_friday$Product_Category_3)

black_friday[is.na(black_friday$Product_Category_3), "Product_Category_3"] <- 0

5.1.3.10 Standard Deviation Purchase

sd(black_friday$Purchase)
## [1] 5023.065

5.1.3.11 Final Summary of Data

summary(black_friday)
##     User_ID           Product_ID     Gender        Age        
##  1001680:  1026   P00265242:  1880   F:135809   0-17 : 15102  
##  1004277:   979   P00025442:  1615   M:414259   18-25: 99660  
##  1001941:   898   P00110742:  1612              26-35:219587  
##  1001181:   862   P00112142:  1562              36-45:110013  
##  1000889:   823   P00057642:  1470              46-50: 45701  
##  1003618:   767   P00184942:  1440              51-55: 38501  
##  (Other):544713   (Other)  :540489              55+  : 21504  
##    Occupation     City_Category Stay_In_Current_City_Years Marital_Status
##  4      : 72308   A:147720      0 : 74398                  0:324731      
##  0      : 69638   B:231173      1 :193821                  1:225337      
##  7      : 59133   C:171175      2 :101838                                
##  1      : 47426                 3 : 95285                                
##  17     : 40043                 4+: 84726                                
##  20     : 33562                                                          
##  (Other):227958                                                          
##  Product_Category_1 Product_Category_2 Product_Category_3    Purchase    
##  Min.   : 1.000     Min.   : 0.000     Min.   : 0.000     Min.   :   12  
##  1st Qu.: 1.000     1st Qu.: 0.000     1st Qu.: 0.000     1st Qu.: 5823  
##  Median : 5.000     Median : 4.000     Median : 0.000     Median : 8047  
##  Mean   : 5.404     Mean   : 6.051     Mean   : 2.976     Mean   : 9264  
##  3rd Qu.: 8.000     3rd Qu.:13.000     3rd Qu.: 5.000     3rd Qu.:12054  
##  Max.   :20.000     Max.   :17.000     Max.   :15.000     Max.   :23961  
## 

5.1.4 Visualizing Data

5.1.4.1 Histogram of Purchase Column

ggplot(black_friday, aes(x = Purchase)) +
  geom_histogram(bins = 75) +
  labs(title= "Purchases Histogram")

5.1.4.2 Histogram of Purchase Column by Gender

ggplot(black_friday, aes(x = Purchase, fill = Gender)) +
  geom_histogram(bins = 75) +
  facet_grid(. ~ Gender) +
  labs(title= "Purchases Histogram by Gender")

5.1.4.3 Histogram of Purchase Column by Age

ggplot(black_friday, aes(x = Purchase, fill = Age)) +
  geom_histogram(bins = 75) +
  facet_wrap(~ Age) +
  labs(title= "Purchases Histogram by Age") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

5.1.4.4 Histogram of Purchase Column by Occupation

ggplot(black_friday, aes(x = Purchase, fill = Occupation)) +
  geom_histogram(bins = 75) +
  facet_wrap(~ Occupation) +
  labs(title= "Purchases Histogram by Occupation") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

5.1.4.5 Histogram of Purchase Column by Marital Status

ggplot(black_friday, aes(x = Purchase, fill = Marital_Status)) +
  geom_histogram(bins = 75) +
  facet_wrap(~ Marital_Status) +
  labs(title= "Purchases Histogram by Marital Status") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

5.1.4.6 Histogram of Purchase Column by Stay in Current City

ggplot(black_friday, aes(x = Purchase, fill = Stay_In_Current_City_Years)) +
  geom_histogram(bins = 75) +
  facet_wrap(~ Stay_In_Current_City_Years) +
  labs(title= "Purchases Histogram by Stay In Current City Years") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

5.1.4.7 Histogram of Purchase Column by City Category

ggplot(black_friday, aes(x = Purchase, fill = City_Category)) +
  geom_histogram(bins = 75) +
  facet_wrap(~ City_Category) +
  labs(title= "Purchases Histogram by City Category") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

5.2 Machine Learning Models

5.2.1 Removing NZV Variables

black_friday_machine <- black_friday %>%
  select(-User_ID, -Product_ID)

summary(black_friday_machine)
##  Gender        Age           Occupation     City_Category
##  F:135809   0-17 : 15102   4      : 72308   A:147720     
##  M:414259   18-25: 99660   0      : 69638   B:231173     
##             26-35:219587   7      : 59133   C:171175     
##             36-45:110013   1      : 47426                
##             46-50: 45701   17     : 40043                
##             51-55: 38501   20     : 33562                
##             55+  : 21504   (Other):227958                
##  Stay_In_Current_City_Years Marital_Status Product_Category_1
##  0 : 74398                  0:324731       Min.   : 1.000    
##  1 :193821                  1:225337       1st Qu.: 1.000    
##  2 :101838                                 Median : 5.000    
##  3 : 95285                                 Mean   : 5.404    
##  4+: 84726                                 3rd Qu.: 8.000    
##                                            Max.   :20.000    
##                                                              
##  Product_Category_2 Product_Category_3    Purchase    
##  Min.   : 0.000     Min.   : 0.000     Min.   :   12  
##  1st Qu.: 0.000     1st Qu.: 0.000     1st Qu.: 5823  
##  Median : 4.000     Median : 0.000     Median : 8047  
##  Mean   : 6.051     Mean   : 2.976     Mean   : 9264  
##  3rd Qu.:13.000     3rd Qu.: 5.000     3rd Qu.:12054  
##  Max.   :17.000     Max.   :15.000     Max.   :23961  
## 

5.2.2 Sampling Data

set.seed(366284)
black_friday_sample <- createDataPartition(y = black_friday_machine$Purchase, 
                               p = 0.1, list=FALSE)
black_friday_sample <- black_friday_machine[black_friday_sample, ]

5.2.2.1 Summary Sample

summary(black_friday_sample)
##  Gender       Age          Occupation    City_Category
##  F:13639   0-17 : 1540   4      : 7235   A:14752      
##  M:41369   18-25: 9945   0      : 7054   B:23110      
##            26-35:22135   7      : 5858   C:17146      
##            36-45:10938   1      : 4721                
##            46-50: 4572   17     : 3935                
##            51-55: 3825   20     : 3383                
##            55+  : 2053   (Other):22822                
##  Stay_In_Current_City_Years Marital_Status Product_Category_1
##  0 : 7324                   0:32427        Min.   : 1.000    
##  1 :19535                   1:22581        1st Qu.: 1.000    
##  2 :10051                                  Median : 5.000    
##  3 : 9623                                  Mean   : 5.406    
##  4+: 8475                                  3rd Qu.: 8.000    
##                                            Max.   :20.000    
##                                                              
##  Product_Category_2 Product_Category_3    Purchase    
##  Min.   : 0.000     Min.   : 0.000     Min.   :   12  
##  1st Qu.: 0.000     1st Qu.: 0.000     1st Qu.: 5823  
##  Median : 4.000     Median : 0.000     Median : 8047  
##  Mean   : 6.054     Mean   : 2.987     Mean   : 9265  
##  3rd Qu.:13.000     3rd Qu.: 5.000     3rd Qu.:12054  
##  Max.   :17.000     Max.   :15.000     Max.   :23961  
## 

5.2.3 Partitioning Data

inTrain <- createDataPartition(y = black_friday_sample$Purchase, 
                               p = 0.7, list=FALSE)
train <- black_friday_sample[inTrain, ]
test <- black_friday_sample[-inTrain, ]

5.2.4 Caret List

#Setting Control Parameters
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE, classProbs = TRUE)

#Creating List with Model Types to be Used
algorithmList <- c('glm', 'glmnet', 'lm', 'treebag', 'gbm')

#Building Models
models <- caretList(Purchase ~ ., train, trControl = control, methodList = algorithmList)

#Testing Models Predictive Accuracy
results <- resamples(models)
summary(results)

5.2.5 Building Ensembles

5.2.5.1 GLMNET Ensemble

stack_glmnet <- caretStack(models, method = "glmnet", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE))
stack_glmnet
## A glmnet ensemble of 2 base models: glm, glmnet, lm, treebag, gbm
## 
## Ensemble results:
## glmnet 
## 
## 115521 samples
##      5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 103970, 103967, 103970, 103969, 103970, 103967, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE      Rsquared   MAE     
##   0.10     8.027276  3001.843  0.6414321  2273.112
##   0.10    80.272756  3008.783  0.6398543  2290.291
##   0.10   802.727561  3044.900  0.6357639  2334.577
##   0.55     8.027276  3001.943  0.6414161  2273.854
##   0.55    80.272756  3006.371  0.6405215  2285.338
##   0.55   802.727561  3073.138  0.6372864  2352.663
##   1.00     8.027276  3001.971  0.6414065  2273.848
##   1.00    80.272756  3004.336  0.6410899  2279.704
##   1.00   802.727561  3108.705  0.6410899  2372.204
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 8.027276.
5.2.5.1.1 Testing Model
5.2.5.1.1.1 Getting Predictions
predictions_glmnet <- predict(stack_glmnet, test)
error <- predictions_glmnet - test$Purchase
5.2.5.1.1.2 Calculating RMSE
sqrt(mean(error^2))
## [1] 2993.751

5.2.5.2 Linear Regression Ensemble

stack_lm <- caretStack(models, method = "lm", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE))
stack_lm
## A lm ensemble of 2 base models: glm, glmnet, lm, treebag, gbm
## 
## Ensemble results:
## Linear Regression 
## 
## 115521 samples
##      5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 103969, 103970, 103968, 103970, 103969, 103969, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE    
##   3001.28  0.6415603  2270.54
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
5.2.5.2.1 Testing Model
5.2.5.2.1.1 Getting Predictions
predictions_lm <- predict(stack_lm, test)
error <- predictions_lm - test$Purchase
5.2.5.2.1.2 Calculating RMSE
sqrt(mean(error^2))
## [1] 2993.222

5.2.5.3 GBM Ensemble

stack_gbm <- caretStack(models, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE))
stack_gbm
## A gbm ensemble of 2 base models: glm, glmnet, lm, treebag, gbm
## 
## Ensemble results:
## Stochastic Gradient Boosting 
## 
## 115521 samples
##      5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 103970, 103970, 103968, 103968, 103968, 103968, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      3023.797  0.6406659  2318.868
##   1                  100      2992.495  0.6437677  2261.503
##   1                  150      2991.331  0.6439519  2258.035
##   2                   50      2993.703  0.6437522  2267.328
##   2                  100      2988.230  0.6446978  2256.138
##   2                  150      2986.675  0.6450584  2253.924
##   3                   50      2989.975  0.6443964  2260.389
##   3                  100      2986.099  0.6451977  2253.835
##   3                  150      2984.273  0.6456250  2251.175
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

5.3 Testing Model

5.3.0.0.0.1 Getting Predictions
predictions_gbm <- predict(stack_gbm, test)
error <- predictions_gbm - test$Purchase
5.3.0.0.0.2 Calculating RMSE
sqrt(mean(error^2))
## [1] 2980.546

5.4 Testing Model

5.4.1 Importing Testing Data

testing <- read.csv("test.csv")

5.4.2 Converting Data

#Converting Marital Status Column
testing$Marital_Status <- factor(testing$Marital_Status)

#Converting Occupation Column
testing$Occupation <- factor(testing$Occupation)

#Converting User_ID Column
testing$User_ID <- factor(testing$User_ID)

#Imputing 0 for NA in Product_Category_2
testing$Product_Category_2 <- as.numeric(testing$Product_Category_2)

testing[is.na(testing$Product_Category_2), "Product_Category_2"] <- 0

#Imputing 0 For NA in Product_Category_3
testing$Product_Category_3 <- as.numeric(testing$Product_Category_3)

testing[is.na(testing$Product_Category_3), "Product_Category_3"] <- 0

#Removing Near Zero Variance Variables
testing_sub <- testing %>%
  select(-User_ID, -Product_ID)

summary(testing_sub)
##  Gender        Age          Occupation    City_Category
##  F: 57827   0-17 : 6232   4      :30778   A:62524      
##  M:175772   18-25:42293   0      :29212   B:98566      
##             26-35:93428   7      :24994   C:72509      
##             36-45:46711   1      :20261                
##             46-50:19577   17     :17375                
##             51-55:16283   20     :14278                
##             55+  : 9075   (Other):96701                
##  Stay_In_Current_City_Years Marital_Status Product_Category_1
##  0 :31318                   0:137807       Min.   : 1.000    
##  1 :82604                   1: 95792       1st Qu.: 1.000    
##  2 :43589                                  Median : 5.000    
##  3 :40143                                  Mean   : 5.277    
##  4+:35945                                  3rd Qu.: 8.000    
##                                            Max.   :18.000    
##                                                              
##  Product_Category_2 Product_Category_3
##  Min.   : 0.000     Min.   : 0.000    
##  1st Qu.: 0.000     1st Qu.: 0.000    
##  Median : 5.000     Median : 0.000    
##  Mean   : 6.799     Mean   : 3.853    
##  3rd Qu.:14.000     3rd Qu.: 8.000    
##  Max.   :18.000     Max.   :18.000    
## 

5.4.3 Final Test

testing_predictions_glmnet <- predict(stack_glmnet, testing_sub)

testing$Purchase <- testing_predictions_glmnet

submission_glmnet <- testing[, c("User_ID", "Product_ID", "Purchase")]

dim(submission_glmnet)
## [1] 233599      3
head(submission_glmnet)
##   User_ID Product_ID  Purchase
## 1 1000004  P00128942 13191.417
## 2 1000009  P00113442 10445.324
## 3 1000010  P00288442  5991.540
## 4 1000010  P00145342  2909.020
## 5 1000011  P00053842  2696.515
## 6 1000013  P00350442 12141.469
write.csv(submission_glmnet, "black_friday_predictions.csv", 
          row.names = FALSE)