ABC Private Limited is a retail company that would like to understand the purchasing habits of their customers so that they can offer a personalized list of products that would interest those customers. This report will explore a month’s worth of sales data from ABC Private Limited. The variable of interest in this report is the amount purchased in the last month. This data also contains demographic information on customers that shopped at ABC Private Limited in the previous month. This demographic information includes age, gender, occupation, city category, stay in current city and marital status. It also contains information on the products purchased, such as their ID and different product category information. Using this information, a machine learning model will be built that will predict the purchase amount based on the customer’s demographics and the categories of the products. In the end, ABC Private Unlimited will have information that they can use to offer customers products that will be appropriate for them.
The first step in exploring the data is loading the libraries that will be used for analysis. Next the data was stored in the variable black_friday. Inspecting the data reveals that every variable other than Purchase is a categorical variable. However, there are some columns that are stored as numerical data. The columns that needed changing were Marital_Status, Occupation, User_ID, Product_Category_1, Product_Category_2 and Product_Category_3. So, these columns were changed to factor variables.
Another problem with the data is that Product_Category_2 and Product_Category_3 have missing values. Since these columns contain categorical data, it would not be appropriate to use median or kNN imputation. However, a value is necessary where one is missing for the machine learning phase of this report. Therefore the value 0 is imputed for the missing data.
Next, the fully clean dataset is examined. It reveals that the median purchase amount for last month was $8,047. In addition, it shows that the mean was $9,264. The minimum purchase amount was $12 and the maximum purchase amount was $23,961. A histogram of the purchase variable shows a unimodal curve that has a positive skew. This positive skew explains why the mean of the purchase amount variable is larger than the median of purchase amount variable.
Further exploration reveals that the when viewing histograms of the purchase amount for the previous month broken down using other variables, the median does not move much. There are times when the histogram will have a larger count. However, in the case of gender, that is to be expected because there were more male customers. There were 135,809 female customers, but there were 414,259 male customers. The same is true when looking at purchase amount broken down by age. The 26-35 year old group made the most purchases, which explains why the count was greater in the Purchases Histogram by Age graph. Interestingly, all histograms reveal a similar median and a similar positive skew. However, people who shop most at ABC Private Limited are males aged 18-45. Additionally they are usually of the marital status 0. Unfortunately, only those at ABC Private Limited know what that means. Additionally, shoppers at ABC Private Limited usually are new to their city and have the City Category B.
This report uses machine learning in order to build a prediction model that will predict a customer’s purchase amount. The first step to building this model is removing the User_ID and Product_ID columns. These columns were removed because they are variables that have zero variance. Each Product_ID and each User_ID will be particular to the customer or product.
After these near zero variance variables were removed, a sample was selected from the data. This sample was selected because this dataset is large. Its size may cause trouble when building a machine learning model. Fortunately, a large sample of the data should be enough to represent the data accurately and to build a machine learning model. The sample that was selected was selected randomly. The sample size 10% of the data. Once the sample is selected, the next step in building the model can be performed.
Next the sampled data was partitioned. 70% of the sample was selected randomly to be the train set. The train set is the data that will be used to build the algorithm. The other 30% of the sample was assigned to the test set. The test set will be used to test the accuracy of the model.
Once the sample is partitioned, a list of machine learning algorithms using different methods is compiled. These algorithms were built using ten-fold repeated cross-validation with three repeats. The models created include a glm model, a glmnet model, a linear regression model, a GBM model and a treebag model. The model that produced the best median RMSE was the gbm model. This model produced a median RMSE of 3002.142. Using this model would produce fairly accurate predictions. However, creating an ensemble model using these models should produce even greater accuracy.
Next, three ensemble models were created. These ensemble models included a linear regression model, a glmnet model and a gbm model. Each model also used ten-fold cross-validation repeated three times. All three of these models proved to be better predictors than any of the other models alone. The gbm model produced an RMSE of 2980.54. The linear model produced an RMSE of 2993.22 and the glmnet model produced an RMSE of 2993.751. Each of these models will be tested on the testing data set to see how well each model predicts the purchase variable.
There are a few conclusions that can be made using the analysis in this paper. The first conclusion is that even when broken down into different demographics, the median purchase made by customers does not fluctuate much. It didn’t matter if the group was male, female, young, old, married or unmarried, the median purchase by the customers hovered around $8000. However, some groups were more present than others. Males shopped more than females. The marital status 0 shopped more than the marital status 1. Unfortunately, which label mean married and which label means unmarried is unknown. Also customers between the ages of 18 and 45 shopped the most. The age range 26-35 had the highest turnout. Additionally, people who only lived in their city for a year shopped a lot.
So there are two different ways that ABC Private Limited could increase their sales. One way would be for them to advertise to groups that do not shop as often as the other groups. The thinking here is that no matter what, median purchase will not change much. However, these unrepresented groups could be represented more and increase purchases. The other option would be for ABC Private Limited to target the demographics that they know show up most often. The thinking here is that these customers produce the largest part of the customer base. Therefore they would be the people most likely to shop in the store.
Finally, the models were tested to find the model that makes the best predictions. When analyzing the data from the testing model, it is revealed that the Product_Category_1, Product_Category_2 and Product_Category_3 variables have new levels. These new variables will present a problem when making predictions. Therefore, the original models needed to be revisited and these variables were left as numeric variables after 0 was imputed for the missing values. Once the different models were used to make predictions, it is discovered the the glmnet stack produced the best predictions. This model produced an RMSE of 3085.436. Therefore, it was the model used to make the final predictions for the testing dataset.
#Loading Libraries
library(dplyr)
library(ggplot2)
library(caret)
library(caretEnsemble)
library(VIM)
library(gridExtra)
#Loading Data
black_friday <- read.csv("train.csv")
head(black_friday)
## User_ID Product_ID Gender Age Occupation City_Category
## 1 1000001 P00069042 F 0-17 10 A
## 2 1000001 P00248942 F 0-17 10 A
## 3 1000001 P00087842 F 0-17 10 A
## 4 1000001 P00085442 F 0-17 10 A
## 5 1000002 P00285442 M 55+ 16 C
## 6 1000003 P00193542 M 26-35 15 A
## Stay_In_Current_City_Years Marital_Status Product_Category_1
## 1 2 0 3
## 2 2 0 1
## 3 2 0 12
## 4 2 0 12
## 5 4+ 0 8
## 6 3 0 1
## Product_Category_2 Product_Category_3 Purchase
## 1 NA NA 8370
## 2 6 14 15200
## 3 NA NA 1422
## 4 14 NA 1057
## 5 NA NA 7969
## 6 2 NA 15227
str(black_friday)
## 'data.frame': 550068 obs. of 12 variables:
## $ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ Product_ID : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 1832 1746 3321 3605 2632 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
## $ Age : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
## $ Occupation : int 10 10 10 10 16 15 7 7 7 20 ...
## $ City_Category : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
## $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
## $ Marital_Status : int 0 0 0 0 0 0 1 1 1 1 ...
## $ Product_Category_1 : int 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : int NA 6 NA 14 NA 2 8 15 16 NA ...
## $ Product_Category_3 : int NA 14 NA NA NA NA 17 NA NA NA ...
## $ Purchase : int 8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
summary(black_friday)
## User_ID Product_ID Gender Age
## Min. :1000001 P00265242: 1880 F:135809 0-17 : 15102
## 1st Qu.:1001516 P00025442: 1615 M:414259 18-25: 99660
## Median :1003077 P00110742: 1612 26-35:219587
## Mean :1003029 P00112142: 1562 36-45:110013
## 3rd Qu.:1004478 P00057642: 1470 46-50: 45701
## Max. :1006040 P00184942: 1440 51-55: 38501
## (Other) :540489 55+ : 21504
## Occupation City_Category Stay_In_Current_City_Years
## Min. : 0.000 A:147720 0 : 74398
## 1st Qu.: 2.000 B:231173 1 :193821
## Median : 7.000 C:171175 2 :101838
## Mean : 8.077 3 : 95285
## 3rd Qu.:14.000 4+: 84726
## Max. :20.000
##
## Marital_Status Product_Category_1 Product_Category_2 Product_Category_3
## Min. :0.0000 Min. : 1.000 Min. : 2.00 Min. : 3.0
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 9.0
## Median :0.0000 Median : 5.000 Median : 9.00 Median :14.0
## Mean :0.4097 Mean : 5.404 Mean : 9.84 Mean :12.7
## 3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.:15.00 3rd Qu.:16.0
## Max. :1.0000 Max. :20.000 Max. :18.00 Max. :18.0
## NA's :173638 NA's :383247
## Purchase
## Min. : 12
## 1st Qu.: 5823
## Median : 8047
## Mean : 9264
## 3rd Qu.:12054
## Max. :23961
##
#Converting Marital Status Column
black_friday$Marital_Status <- factor(black_friday$Marital_Status)
#Converting Occupation Column
black_friday$Occupation <- factor(black_friday$Occupation)
#Converting User_ID Column
black_friday$User_ID <- factor(black_friday$User_ID)
#Converting Product_Category_2 Column
black_friday$Product_Category_2 <-
factor(black_friday$Product_Category_2)
#Converting Product_Category_3 Column
black_friday$Product_Category_3 <- factor(black_friday$Product_Category_3)
summary(black_friday)
## User_ID Product_ID Gender Age
## 1001680: 1026 P00265242: 1880 F:135809 0-17 : 15102
## 1004277: 979 P00025442: 1615 M:414259 18-25: 99660
## 1001941: 898 P00110742: 1612 26-35:219587
## 1001181: 862 P00112142: 1562 36-45:110013
## 1000889: 823 P00057642: 1470 46-50: 45701
## 1003618: 767 P00184942: 1440 51-55: 38501
## (Other):544713 (Other) :540489 55+ : 21504
## Occupation City_Category Stay_In_Current_City_Years Marital_Status
## 4 : 72308 A:147720 0 : 74398 0:324731
## 0 : 69638 B:231173 1 :193821 1:225337
## 7 : 59133 C:171175 2 :101838
## 1 : 47426 3 : 95285
## 17 : 40043 4+: 84726
## 20 : 33562
## (Other):227958
## Product_Category_1 Product_Category_2 Product_Category_3 Purchase
## Min. : 1.000 8 : 64088 16 : 32636 Min. : 12
## 1st Qu.: 1.000 14 : 55108 15 : 28013 1st Qu.: 5823
## Median : 5.000 2 : 49217 14 : 18428 Median : 8047
## Mean : 5.404 16 : 43255 17 : 16702 Mean : 9264
## 3rd Qu.: 8.000 15 : 37855 5 : 16658 3rd Qu.:12054
## Max. :20.000 (Other):126907 (Other): 54384 Max. :23961
## NA's :173638 NA's :383247
table(black_friday$Product_Category_1)
##
## 1 2 3 4 5 6 7 8 9 10
## 140378 23864 20213 11753 150933 20466 3721 113925 410 5125
## 11 12 13 14 15 16 17 18 19 20
## 24287 3947 5549 1523 6290 9828 578 3125 1603 2550
table(black_friday$Product_Category_2)
##
## 2 3 4 5 6 7 8 9 10 11 12 13
## 49217 2884 25677 26235 16466 626 64088 5693 3043 14134 5528 10531
## 14 15 16 17 18
## 55108 37855 43255 13320 2770
table(black_friday$Product_Category_3)
##
## 3 4 5 6 8 9 10 11 12 13 14 15
## 613 1875 16658 4890 12562 11579 1726 1805 9246 5459 18428 28013
## 16 17 18
## 32636 16702 4629
#Imputing 0 for NA in Product_Category_2
black_friday$Product_Category_2 <- as.numeric(black_friday$Product_Category_2)
black_friday[is.na(black_friday$Product_Category_2), "Product_Category_2"] <- 0
#Imputing 0 For NA in Product_Category_3
black_friday$Product_Category_3 <- as.numeric(black_friday$Product_Category_3)
black_friday[is.na(black_friday$Product_Category_3), "Product_Category_3"] <- 0
sd(black_friday$Purchase)
## [1] 5023.065
summary(black_friday)
## User_ID Product_ID Gender Age
## 1001680: 1026 P00265242: 1880 F:135809 0-17 : 15102
## 1004277: 979 P00025442: 1615 M:414259 18-25: 99660
## 1001941: 898 P00110742: 1612 26-35:219587
## 1001181: 862 P00112142: 1562 36-45:110013
## 1000889: 823 P00057642: 1470 46-50: 45701
## 1003618: 767 P00184942: 1440 51-55: 38501
## (Other):544713 (Other) :540489 55+ : 21504
## Occupation City_Category Stay_In_Current_City_Years Marital_Status
## 4 : 72308 A:147720 0 : 74398 0:324731
## 0 : 69638 B:231173 1 :193821 1:225337
## 7 : 59133 C:171175 2 :101838
## 1 : 47426 3 : 95285
## 17 : 40043 4+: 84726
## 20 : 33562
## (Other):227958
## Product_Category_1 Product_Category_2 Product_Category_3 Purchase
## Min. : 1.000 Min. : 0.000 Min. : 0.000 Min. : 12
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 5823
## Median : 5.000 Median : 4.000 Median : 0.000 Median : 8047
## Mean : 5.404 Mean : 6.051 Mean : 2.976 Mean : 9264
## 3rd Qu.: 8.000 3rd Qu.:13.000 3rd Qu.: 5.000 3rd Qu.:12054
## Max. :20.000 Max. :17.000 Max. :15.000 Max. :23961
##
ggplot(black_friday, aes(x = Purchase)) +
geom_histogram(bins = 75) +
labs(title= "Purchases Histogram")
ggplot(black_friday, aes(x = Purchase, fill = Gender)) +
geom_histogram(bins = 75) +
facet_grid(. ~ Gender) +
labs(title= "Purchases Histogram by Gender")
ggplot(black_friday, aes(x = Purchase, fill = Age)) +
geom_histogram(bins = 75) +
facet_wrap(~ Age) +
labs(title= "Purchases Histogram by Age") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(black_friday, aes(x = Purchase, fill = Occupation)) +
geom_histogram(bins = 75) +
facet_wrap(~ Occupation) +
labs(title= "Purchases Histogram by Occupation") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(black_friday, aes(x = Purchase, fill = Marital_Status)) +
geom_histogram(bins = 75) +
facet_wrap(~ Marital_Status) +
labs(title= "Purchases Histogram by Marital Status") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(black_friday, aes(x = Purchase, fill = Stay_In_Current_City_Years)) +
geom_histogram(bins = 75) +
facet_wrap(~ Stay_In_Current_City_Years) +
labs(title= "Purchases Histogram by Stay In Current City Years") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(black_friday, aes(x = Purchase, fill = City_Category)) +
geom_histogram(bins = 75) +
facet_wrap(~ City_Category) +
labs(title= "Purchases Histogram by City Category") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
black_friday_machine <- black_friday %>%
select(-User_ID, -Product_ID)
summary(black_friday_machine)
## Gender Age Occupation City_Category
## F:135809 0-17 : 15102 4 : 72308 A:147720
## M:414259 18-25: 99660 0 : 69638 B:231173
## 26-35:219587 7 : 59133 C:171175
## 36-45:110013 1 : 47426
## 46-50: 45701 17 : 40043
## 51-55: 38501 20 : 33562
## 55+ : 21504 (Other):227958
## Stay_In_Current_City_Years Marital_Status Product_Category_1
## 0 : 74398 0:324731 Min. : 1.000
## 1 :193821 1:225337 1st Qu.: 1.000
## 2 :101838 Median : 5.000
## 3 : 95285 Mean : 5.404
## 4+: 84726 3rd Qu.: 8.000
## Max. :20.000
##
## Product_Category_2 Product_Category_3 Purchase
## Min. : 0.000 Min. : 0.000 Min. : 12
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 5823
## Median : 4.000 Median : 0.000 Median : 8047
## Mean : 6.051 Mean : 2.976 Mean : 9264
## 3rd Qu.:13.000 3rd Qu.: 5.000 3rd Qu.:12054
## Max. :17.000 Max. :15.000 Max. :23961
##
set.seed(366284)
black_friday_sample <- createDataPartition(y = black_friday_machine$Purchase,
p = 0.1, list=FALSE)
black_friday_sample <- black_friday_machine[black_friday_sample, ]
summary(black_friday_sample)
## Gender Age Occupation City_Category
## F:13639 0-17 : 1540 4 : 7235 A:14752
## M:41369 18-25: 9945 0 : 7054 B:23110
## 26-35:22135 7 : 5858 C:17146
## 36-45:10938 1 : 4721
## 46-50: 4572 17 : 3935
## 51-55: 3825 20 : 3383
## 55+ : 2053 (Other):22822
## Stay_In_Current_City_Years Marital_Status Product_Category_1
## 0 : 7324 0:32427 Min. : 1.000
## 1 :19535 1:22581 1st Qu.: 1.000
## 2 :10051 Median : 5.000
## 3 : 9623 Mean : 5.406
## 4+: 8475 3rd Qu.: 8.000
## Max. :20.000
##
## Product_Category_2 Product_Category_3 Purchase
## Min. : 0.000 Min. : 0.000 Min. : 12
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 5823
## Median : 4.000 Median : 0.000 Median : 8047
## Mean : 6.054 Mean : 2.987 Mean : 9265
## 3rd Qu.:13.000 3rd Qu.: 5.000 3rd Qu.:12054
## Max. :17.000 Max. :15.000 Max. :23961
##
inTrain <- createDataPartition(y = black_friday_sample$Purchase,
p = 0.7, list=FALSE)
train <- black_friday_sample[inTrain, ]
test <- black_friday_sample[-inTrain, ]
#Setting Control Parameters
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE, classProbs = TRUE)
#Creating List with Model Types to be Used
algorithmList <- c('glm', 'glmnet', 'lm', 'treebag', 'gbm')
#Building Models
models <- caretList(Purchase ~ ., train, trControl = control, methodList = algorithmList)
#Testing Models Predictive Accuracy
results <- resamples(models)
summary(results)
stack_glmnet <- caretStack(models, method = "glmnet", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE))
stack_glmnet
## A glmnet ensemble of 2 base models: glm, glmnet, lm, treebag, gbm
##
## Ensemble results:
## glmnet
##
## 115521 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 103970, 103967, 103970, 103969, 103970, 103967, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 8.027276 3001.843 0.6414321 2273.112
## 0.10 80.272756 3008.783 0.6398543 2290.291
## 0.10 802.727561 3044.900 0.6357639 2334.577
## 0.55 8.027276 3001.943 0.6414161 2273.854
## 0.55 80.272756 3006.371 0.6405215 2285.338
## 0.55 802.727561 3073.138 0.6372864 2352.663
## 1.00 8.027276 3001.971 0.6414065 2273.848
## 1.00 80.272756 3004.336 0.6410899 2279.704
## 1.00 802.727561 3108.705 0.6410899 2372.204
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 8.027276.
predictions_glmnet <- predict(stack_glmnet, test)
error <- predictions_glmnet - test$Purchase
sqrt(mean(error^2))
## [1] 2993.751
stack_lm <- caretStack(models, method = "lm", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE))
stack_lm
## A lm ensemble of 2 base models: glm, glmnet, lm, treebag, gbm
##
## Ensemble results:
## Linear Regression
##
## 115521 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 103969, 103970, 103968, 103970, 103969, 103969, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3001.28 0.6415603 2270.54
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
predictions_lm <- predict(stack_lm, test)
error <- predictions_lm - test$Purchase
sqrt(mean(error^2))
## [1] 2993.222
stack_gbm <- caretStack(models, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE))
stack_gbm
## A gbm ensemble of 2 base models: glm, glmnet, lm, treebag, gbm
##
## Ensemble results:
## Stochastic Gradient Boosting
##
## 115521 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 103970, 103970, 103968, 103968, 103968, 103968, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 3023.797 0.6406659 2318.868
## 1 100 2992.495 0.6437677 2261.503
## 1 150 2991.331 0.6439519 2258.035
## 2 50 2993.703 0.6437522 2267.328
## 2 100 2988.230 0.6446978 2256.138
## 2 150 2986.675 0.6450584 2253.924
## 3 50 2989.975 0.6443964 2260.389
## 3 100 2986.099 0.6451977 2253.835
## 3 150 2984.273 0.6456250 2251.175
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
predictions_gbm <- predict(stack_gbm, test)
error <- predictions_gbm - test$Purchase
sqrt(mean(error^2))
## [1] 2980.546
testing <- read.csv("test.csv")
#Converting Marital Status Column
testing$Marital_Status <- factor(testing$Marital_Status)
#Converting Occupation Column
testing$Occupation <- factor(testing$Occupation)
#Converting User_ID Column
testing$User_ID <- factor(testing$User_ID)
#Imputing 0 for NA in Product_Category_2
testing$Product_Category_2 <- as.numeric(testing$Product_Category_2)
testing[is.na(testing$Product_Category_2), "Product_Category_2"] <- 0
#Imputing 0 For NA in Product_Category_3
testing$Product_Category_3 <- as.numeric(testing$Product_Category_3)
testing[is.na(testing$Product_Category_3), "Product_Category_3"] <- 0
#Removing Near Zero Variance Variables
testing_sub <- testing %>%
select(-User_ID, -Product_ID)
summary(testing_sub)
## Gender Age Occupation City_Category
## F: 57827 0-17 : 6232 4 :30778 A:62524
## M:175772 18-25:42293 0 :29212 B:98566
## 26-35:93428 7 :24994 C:72509
## 36-45:46711 1 :20261
## 46-50:19577 17 :17375
## 51-55:16283 20 :14278
## 55+ : 9075 (Other):96701
## Stay_In_Current_City_Years Marital_Status Product_Category_1
## 0 :31318 0:137807 Min. : 1.000
## 1 :82604 1: 95792 1st Qu.: 1.000
## 2 :43589 Median : 5.000
## 3 :40143 Mean : 5.277
## 4+:35945 3rd Qu.: 8.000
## Max. :18.000
##
## Product_Category_2 Product_Category_3
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 5.000 Median : 0.000
## Mean : 6.799 Mean : 3.853
## 3rd Qu.:14.000 3rd Qu.: 8.000
## Max. :18.000 Max. :18.000
##
testing_predictions_glmnet <- predict(stack_glmnet, testing_sub)
testing$Purchase <- testing_predictions_glmnet
submission_glmnet <- testing[, c("User_ID", "Product_ID", "Purchase")]
dim(submission_glmnet)
## [1] 233599 3
head(submission_glmnet)
## User_ID Product_ID Purchase
## 1 1000004 P00128942 13191.417
## 2 1000009 P00113442 10445.324
## 3 1000010 P00288442 5991.540
## 4 1000010 P00145342 2909.020
## 5 1000011 P00053842 2696.515
## 6 1000013 P00350442 12141.469
write.csv(submission_glmnet, "black_friday_predictions.csv",
row.names = FALSE)