Black Friday is an informal name for the day following Thanksgiving Day in the United States, the fourth Thursday of November, which has been regarded as the beginning of the country’s Christmas shopping season since 1952.
Most major retailers open very early, as early as overnight hours, and offer promotional sales. Black Friday is not an official holiday, but California and some other states observe “The Day After Thanksgiving” as a holiday for state government employees, sometimes in lieu of another federal holiday such as the Columbus Day.[2] Many non-retail employees and schools have both Thanksgiving and the following Friday off, which, along with the following regular weekend, makes it a four-day weekend, thereby increasing the number of potential shoppers.
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various factors. They have shared purchase summary of various customers for selected high volume products from last month.
The data set contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer.
The data set contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
setwd("G:/Data Science Projects/Black Friday Regression sales")
sale.df <- read.csv("train.csv",sep = ",")
View(sale.df)
dim(sale.df)
## [1] 550068 12
The Dataset consists of 550068 rows and 12 coloumns.
str(sale.df)
## 'data.frame': 550068 obs. of 12 variables:
## $ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ Product_ID : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 1832 1746 3321 3605 2632 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
## $ Age : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
## $ Occupation : int 10 10 10 10 16 15 7 7 7 20 ...
## $ City_Category : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
## $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
## $ Marital_Status : int 0 0 0 0 0 0 1 1 1 1 ...
## $ Product_Category_1 : int 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : int NA 6 NA 14 NA 2 8 15 16 NA ...
## $ Product_Category_3 : int NA 14 NA NA NA NA 17 NA NA NA ...
## $ Purchase : int 8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
library(psych)
describe(sale.df)
## vars n mean sd median
## User_ID 1 550068 1003028.84 1727.59 1003077
## Product_ID* 2 550068 1708.47 1012.20 1667
## Gender* 3 550068 1.75 0.43 2
## Age* 4 550068 3.50 1.35 3
## Occupation 5 550068 8.08 6.52 7
## City_Category* 6 550068 2.04 0.76 2
## Stay_In_Current_City_Years* 7 550068 2.86 1.29 3
## Marital_Status 8 550068 0.41 0.49 0
## Product_Category_1 9 550068 5.40 3.94 5
## Product_Category_2 10 376430 9.84 5.09 9
## Product_Category_3 11 166821 12.67 4.13 14
## Purchase 12 550068 9263.97 5023.07 8047
## trimmed mad min max range skew
## User_ID 1003026.91 2176.46 1000001 1006040 6039 0.00
## Product_ID* 1688.69 1197.94 1 3631 3630 0.15
## Gender* 1.82 0.00 1 2 1 -1.17
## Age* 3.36 1.48 1 7 6 0.81
## Occupation 7.69 8.90 0 20 20 0.40
## City_Category* 2.05 1.48 1 3 2 -0.07
## Stay_In_Current_City_Years* 2.82 1.48 1 5 4 0.32
## Marital_Status 0.39 0.00 0 1 1 0.37
## Product_Category_1 4.90 4.45 1 20 19 1.03
## Product_Category_2 9.99 7.41 2 18 16 -0.16
## Product_Category_3 13.07 2.97 3 18 15 -0.77
## Purchase 8929.12 4256.54 12 23961 23949 0.60
## kurtosis se
## User_ID -1.20 2.33
## Product_ID* -1.09 1.36
## Gender* -0.62 0.00
## Age* 0.30 0.00
## Occupation -1.22 0.01
## City_Category* -1.27 0.00
## Stay_In_Current_City_Years* -1.07 0.00
## Marital_Status -1.86 0.00
## Product_Category_1 1.23 0.01
## Product_Category_2 -1.43 0.01
## Product_Category_3 -0.81 0.01
## Purchase -0.34 6.77
library(psych)
headTail(xtabs(~Purchase+Marital_Status,data = sale.df))
## X0 X1
## 12 57 44
## 13 63 43
## 14 53 42
## 24 78 40
## ... ... ...
## 23958 2 2
## 23959 1 1
## 23960 1 3
## 23961 2 1
library(psych)
headTail(xtabs(~Purchase+Age,data=sale.df))
## X0.17 X18.25 X26.35 X36.45 X46.50 X51.55 X55.
## 12 3 20 29 23 12 7 7
## 13 3 17 50 16 10 5 5
## 14 2 19 33 19 7 13 2
## 24 5 21 46 22 9 8 7
## ... ... ... ... ... ... ... ...
## 23958 0 2 0 0 0 1 1
## 23959 0 0 1 0 0 1 0
## 23960 0 0 0 1 1 1 1
## 23961 0 0 3 0 0 0 0
library(psych)
headTail(xtabs(~Purchase+Gender,data = sale.df))
## F M
## 12 27 74
## 13 25 81
## 14 30 65
## 24 28 90
## ... ... ...
## 23958 0 4
## 23959 1 1
## 23960 0 4
## 23961 0 3
plot(sale.df$Gender,ylim =c(0,420000),xlab=c("GENDER F=FEMALE M=MALE"),main="GENDER DISTRIBUTION",col="lightgreen")
table(sale.df$Gender)
##
## F M
## 135809 414259
plot(sale.df$Gender,sale.df$Purchase,main="BOXPLOT OF PURCHASE AMOUNT ACCORDING TO GENDER",xlab="GENDER",ylab="PURCHASE AMOUNT",col ="peachpuff")
## MARTIAL STATUS OF CUSTOMER
sale.df$Marital_Status <- factor(sale.df$Marital_Status,levels = c(0,1),labels = c("UNMARRIED","MARRIED"))
plot(sale.df$Marital_Status,ylim=c(0,330000),main="COUNT OF MARTIAL STATUS",col="lightblue")
sale.df$Marital_Status[sale.df$Marital_Status== "0"] <- "UNMARRIED"
sale.df$Marital_Status[sale.df$Marital_Status== "1"] <- "MARRIED"
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
data("sale.df",package = "ggplot2")
## Warning in data("sale.df", package = "ggplot2"): data set 'sale.df' not
## found
ggplot(sale.df,aes(x=Marital_Status,y=Purchase))+geom_boxplot(col="tan2",size=1)+
labs(title="BOXPLOT",subtitle="PURCHASE AMOUNT VS MARITAL STATUS",y="PURCHASE AMOUNT")
library(ggplot2)
data("sale.df",package = "ggplot2")
## Warning in data("sale.df", package = "ggplot2"): data set 'sale.df' not
## found
g1<- ggplot(sale.df,aes(x=Age))+geom_bar(size=2,col="steelblue2")+
labs(y="PURCHASE AMOUNT")
plot(g1)
It is clear that people between age 26-35 spends more than all other age groups and age group 36-45 is after age group 26-35.
plot(sale.df$Stay_In_Current_City_Years,col="steelblue2",main="PURCHASE AMOUNT AND STAY IN CURRENT CITY OF CUSTOMER",ylab="PURCHASE AMOUNT",xlabs="STAY IN CURRENT CITY IN YEARS")
## Warning in plot.window(xlim, ylim, log = log, ...): "xlabs" is not a
## graphical parameter
## Warning in axis(if (horiz) 2 else 1, at = at.l, labels = names.arg, lty =
## axis.lty, : "xlabs" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "xlabs" is not a graphical parameter
## Warning in axis(if (horiz) 1 else 2, cex.axis = cex.axis, ...): "xlabs" is
## not a graphical parameter
From above plot we get that Customer who live in current city form 1-2 years shop more than others.
library(ggplot2)
ggplot(sale.df,aes(x=Product_Category_1,y=Purchase))+geom_col(aes(col=Gender))
## PRODUCT CATEGORY 2 SALES
library(ggplot2)
ggplot(sale.df,aes(x=Product_Category_2,y=Purchase))+geom_col(col="green")
## Warning: Removed 173638 rows containing missing values (position_stack).
library(ggplot2)
ggplot(sale.df,aes(x=Product_Category_3,y=Purchase))+geom_col(col="lightblue")
## Warning: Removed 383247 rows containing missing values (position_stack).
library(ggplot2)
theme_set(theme_classic())
data("sale.df",package = "ggplot2")
## Warning in data("sale.df", package = "ggplot2"): data set 'sale.df' not
## found
ggplot(sale.df,aes(x=Product_Category_1,y=Purchase))+geom_boxplot(,size=2)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
boxplot(sale.df$Purchase,sale.df$Product_Category_2,col="blue")
boxplot(sale.df$Purchase,sale.df$Product_Category_3,col="blue")
library(ggplot2)
g1<-ggplot(sale.df, aes(x=Age, y=Purchase)) + geom_col(aes(col=Gender),size=2,width = 0.8)
plot(g1)
library(ggplot2)
sale.df$Marital_Status[sale.df$Marital_Status== "0"] <- "UNMARRIED"
sale.df$Marital_Status[sale.df$Marital_Status== "1"] <- "MARRIED"
library(ggplot2)
g1<-ggplot(sale.df, aes(x=Gender, y=Purchase)) + geom_col(aes(col=Marital_Status),size=2,width = 0.8)
plot(g1)
library(ggplot2)
g1<-ggplot(sale.df, aes(x=Marital_Status, y=Purchase)) + geom_col(aes(col=Gender),size=2,width = 0.8)
plot(g1)
By looking at these plots , It is obvious that the unmarried males are buying more! And in case of Females.
plot(sale.df$City_Category)
City category B has more sales than others. ## Occupation Vs Purchase
library(ggplot2)
g1<-ggplot(sale.df, aes(x=Occupation, y=Purchase)) + geom_col(col="lightblue",size=2,width = 0.8)
plot(g1)
round(cor(Filter(is.numeric, sale.df)),2)
## User_ID Occupation Product_Category_1
## User_ID 1.00 -0.02 0.00
## Occupation -0.02 1.00 -0.01
## Product_Category_1 0.00 -0.01 1.00
## Product_Category_2 NA NA NA
## Product_Category_3 NA NA NA
## Purchase 0.00 0.02 -0.34
## Product_Category_2 Product_Category_3 Purchase
## User_ID NA NA 0.00
## Occupation NA NA 0.02
## Product_Category_1 NA NA -0.34
## Product_Category_2 1 NA NA
## Product_Category_3 NA 1 NA
## Purchase NA NA 1.00
library(corrgram)
corrgram(sale.df, upper.panel=panel.pie,main= "Corrgram of store variables" )
## T test
Hypothesis: There is differenece between Purchase Amount Males and Females. Null Hypothesis: There is no differenece between Purchase Amount Males and Females.
sale.df$Gender <- ifelse(sale.df$Gender == "F", 1, 0)
t.test(sale.df$Purchase,sale.df$Gender)
##
## Welch Two Sample t-test
##
## data: sale.df$Purchase and sale.df$Gender
## t = 1367.8, df = 550070, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 9250.448 9276.996
## sample estimates:
## mean of x mean of y
## 9263.9687130 0.2468949
P-value<0.5.Hence, we Reject the Null Hypothesis That there is no differenece between Purchase Amount Males and Females.
#Conerting Gender to Binary
sale.df$Gender <- ifelse(sale.df$Gender == "F", 1, 0)
cor.test(sale.df$Purchase,sale.df$Gender)
## Warning in cor(x, y): the standard deviation is zero
##
## Pearson's product-moment correlation
##
## data: sale.df$Purchase and sale.df$Gender
## t = NA, df = 550070, p-value = NA
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## NA NA
## sample estimates:
## cor
## NA
As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and Gender .
2.Test Correlation between Purchase Amount And stay in current city of customer.
sale.df$Stay_In_Current_City_Years[sale.df$Stay_In_Current_City_Years == "4+"] <- "4"
## Warning in `[<-.factor`(`*tmp*`, sale.df$Stay_In_Current_City_Years ==
## "4+", : invalid factor level, NA generated
sale.df$Stay_In_Current_City_Years <- as.integer(sale.df$Stay_In_Current_City_Years)
cor.test(sale.df$Stay_In_Current_City_Years,sale.df$Purchase)
##
## Pearson's product-moment correlation
##
## data: sale.df$Stay_In_Current_City_Years and sale.df$Purchase
## t = 4.9632, df = 465340, p-value = 6.937e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.004402452 0.010148499
## sample estimates:
## cor
## 0.007275536
As p-value>0.5,we can conclude that there is no relation between Purchase Amount and stay in current city in Years.
3.Test Correlation between Purchase Amount And stay in Occupation of customer.
cor.test(sale.df$Purchase,sale.df$Occupation)
##
## Pearson's product-moment correlation
##
## data: sale.df$Purchase and sale.df$Occupation
## t = 15.454, df = 550070, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01819097 0.02347398
## sample estimates:
## cor
## 0.02083262
As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and Occupation of Customer.
3.Test Correlation between Purchase Amount And product in category 1.
cor.test(sale.df$Purchase,sale.df$Product_Category_1)
##
## Pearson's product-moment correlation
##
## data: sale.df$Purchase and sale.df$Product_Category_1
## t = -271.45, df = 550070, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3460317 -0.3413708
## sample estimates:
## cor
## -0.3437033
As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and product in category 1.
4.Test Correlation between Purchase Amount And product in category 2.
cor.test(sale.df$Purchase,sale.df$Product_Category_2)
##
## Pearson's product-moment correlation
##
## data: sale.df$Purchase and sale.df$Product_Category_2
## t = -131.73, df = 376430, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2129702 -0.2068627
## sample estimates:
## cor
## -0.2099185
As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and product in category 2.
5.Test Correlation between Purchase Amount And product in category 3.
cor.test(sale.df$Purchase,sale.df$Product_Category_3)
##
## Pearson's product-moment correlation
##
## data: sale.df$Purchase and sale.df$Product_Category_3
## t = -8.9901, df = 166820, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02680159 -0.01720885
## sample estimates:
## cor
## -0.02200573
As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and product in category 3.
Hypothesis: There is differenece between Purchase Amount Males and Females. Null Hypothesis: There is no differenece between Purchase Amount Males and Females.
In order to test Hypothesis 1a, we proposed the following model:
Model1 <- (Purchase~Gender+Occupation+City_Category+Stay_In_Current_City_Years+Product_Category_1+Marital_Status+Product_Category_2+Product_Category_3)
fit1 <- lm(Model1, data = sale.df)
summary(fit1)
##
## Call:
## lm(formula = Model1, data = sale.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10748.2 -2809.1 -433.2 2760.6 19914.6
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12258.330 57.139 214.536 < 2e-16 ***
## Gender NA NA NA NA
## Occupation 12.121 1.906 6.359 2.03e-10 ***
## City_CategoryB 219.141 31.224 7.018 2.26e-12 ***
## City_CategoryC 837.906 32.725 25.605 < 2e-16 ***
## Stay_In_Current_City_Years 19.963 12.467 1.601 0.1093
## Product_Category_1 -836.038 5.551 -150.604 < 2e-16 ***
## Marital_StatusMARRIED 44.223 25.167 1.757 0.0789 .
## Product_Category_2 26.114 3.677 7.103 1.23e-12 ***
## Product_Category_3 76.263 3.564 21.397 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4637 on 141450 degrees of freedom
## (408609 observations deleted due to missingness)
## Multiple R-squared: 0.1689, Adjusted R-squared: 0.1688
## F-statistic: 3593 on 8 and 141450 DF, p-value: < 2.2e-16
library(leaps)
leap <- regsubsets(Model1, data = sale.df, nbest=1)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 1 linear dependencies found
## Reordering variables and trying again:
summary(leap)
## Subset selection object
## Call: regsubsets.formula(Model1, data = sale.df, nbest = 1)
## 9 Variables (and intercept)
## Forced in Forced out
## Occupation FALSE FALSE
## City_CategoryB FALSE FALSE
## City_CategoryC FALSE FALSE
## Stay_In_Current_City_Years FALSE FALSE
## Product_Category_1 FALSE FALSE
## Marital_StatusMARRIED FALSE FALSE
## Product_Category_2 FALSE FALSE
## Product_Category_3 FALSE FALSE
## Gender FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## Gender Occupation City_CategoryB City_CategoryC
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " "*"
## 4 ( 1 ) " " " " "*" "*"
## 5 ( 1 ) " " " " "*" "*"
## 6 ( 1 ) " " "*" "*" "*"
## 7 ( 1 ) " " "*" "*" "*"
## 8 ( 1 ) " " "*" "*" "*"
## Stay_In_Current_City_Years Product_Category_1
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## Marital_StatusMARRIED Product_Category_2 Product_Category_3
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " "*"
## 3 ( 1 ) " " " " "*"
## 4 ( 1 ) " " " " "*"
## 5 ( 1 ) " " "*" "*"
## 6 ( 1 ) " " "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
plot(leap, scale="adjr2")
## MODEL2
Hence, by eliminating few variables from Model 1, we predict the Model 2 by
y=B0+B1x1+B2x2+B3x3+B4x4+B5x5
where,
y= Purchase amount of Customer, x1= Stay_In_Current_City_Years, x2= Product_Category_1, x3= Marital_Status, x4= Product_Category_2, x5= Product_Category_3,
Model2 <- (Purchase~Stay_In_Current_City_Years+Product_Category_1+Marital_Status+Product_Category_2+Product_Category_3)
fit12 <- lm(Model2, data = sale.df)
summary(fit12)
##
## Call:
## lm(formula = Model2, data = sale.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10294 -2875 -518 2707 19551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12706.183 51.918 244.734 < 2e-16 ***
## Stay_In_Current_City_Years 23.859 12.492 1.910 0.05614 .
## Product_Category_1 -840.728 5.565 -151.084 < 2e-16 ***
## Marital_StatusMARRIED 79.658 25.198 3.161 0.00157 **
## Product_Category_2 26.878 3.687 7.289 3.13e-13 ***
## Product_Category_3 76.660 3.574 21.448 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4650 on 141453 degrees of freedom
## (408609 observations deleted due to missingness)
## Multiple R-squared: 0.164, Adjusted R-squared: 0.164
## F-statistic: 5552 on 5 and 141453 DF, p-value: < 2.2e-16
From Model 2, We established the effect of Product Categeory,Marital status and Stay in current city on Purchase amount of customer with the simplest model. We regressed Purchase amount of customer on different products.We estimated model, using simple linear Regressional Model.
library(leaps)
leap1 <- regsubsets(Model2, data = sale.df, nbest=1)
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(Model2, data = sale.df, nbest = 1)
## 5 Variables (and intercept)
## Forced in Forced out
## Stay_In_Current_City_Years FALSE FALSE
## Product_Category_1 FALSE FALSE
## Marital_StatusMARRIED FALSE FALSE
## Product_Category_2 FALSE FALSE
## Product_Category_3 FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## Stay_In_Current_City_Years Product_Category_1
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) "*" "*"
## Marital_StatusMARRIED Product_Category_2 Product_Category_3
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " "*"
## 3 ( 1 ) " " "*" "*"
## 4 ( 1 ) "*" "*" "*"
## 5 ( 1 ) "*" "*" "*"
plot(leap1, scale="adjr2")
coefficients(fit12)
## (Intercept) Stay_In_Current_City_Years
## 12706.18347 23.85871
## Product_Category_1 Marital_StatusMARRIED
## -840.72818 79.65808
## Product_Category_2 Product_Category_3
## 26.87784 76.65997
Above Are the Beta-coefficients according to respective Variable.
We found the Model 2 was best fit model,with p-value < 0.5.Purchase amount of customer mostly depend upon Product Categeory and Marital status of costomer.
1.Purchase amount for males is high except few categoires. so it might possible that those categories contain items which are used by females more than males otherwise males should be our main focus.
2.People between age 26-35 spends more than all other age groups and age group 36-45 is after age group 26-35.Almost same trend goes on for all the product category so we should focus more on these two age groups along with age group 18-25.
3.People who are not married are spending more than people who are not married especially Males.There are more changes in female purchase after marriage than male purchase.
This paper was motivated by the need for research that could improve our understanding the behavior of customer regaurding various products anf factors. The unique contribution of this paper is that we investigated the Purchase amount of customer over a stores and studied their behavior. We found that Purchase amount for males is high except few categoires. we observed People who are not married are spending more than people who are married especially Males.There are more chances in female purchase after marriage than male purchase.