INTRODUCTION

Black Friday is an informal name for the day following Thanksgiving Day in the United States, the fourth Thursday of November, which has been regarded as the beginning of the country’s Christmas shopping season since 1952.

Most major retailers open very early, as early as overnight hours, and offer promotional sales. Black Friday is not an official holiday, but California and some other states observe “The Day After Thanksgiving” as a holiday for state government employees, sometimes in lieu of another federal holiday such as the Columbus Day.[2] Many non-retail employees and schools have both Thanksgiving and the following Friday off, which, along with the following regular weekend, makes it a four-day weekend, thereby increasing the number of potential shoppers.

OVERVIEW OF STUDY

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various factors. They have shared purchase summary of various customers for selected high volume products from last month.

The data set contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer.

DATA

The data set contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Reading Dataset

 setwd("G:/Data Science Projects/Black Friday Regression sales")
sale.df <- read.csv("train.csv",sep = ",")
View(sale.df)
dim(sale.df)
## [1] 550068     12

The Dataset consists of 550068 rows and 12 coloumns.

summary

str(sale.df)
## 'data.frame':    550068 obs. of  12 variables:
##  $ User_ID                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ Product_ID                : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 1832 1746 3321 3605 2632 ...
##  $ Gender                    : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
##  $ Age                       : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
##  $ Occupation                : int  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_Category             : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
##  $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
##  $ Marital_Status            : int  0 0 0 0 0 0 1 1 1 1 ...
##  $ Product_Category_1        : int  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : int  NA 6 NA 14 NA 2 8 15 16 NA ...
##  $ Product_Category_3        : int  NA 14 NA NA NA NA 17 NA NA NA ...
##  $ Purchase                  : int  8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
library(psych)
describe(sale.df)
##                             vars      n       mean      sd  median
## User_ID                        1 550068 1003028.84 1727.59 1003077
## Product_ID*                    2 550068    1708.47 1012.20    1667
## Gender*                        3 550068       1.75    0.43       2
## Age*                           4 550068       3.50    1.35       3
## Occupation                     5 550068       8.08    6.52       7
## City_Category*                 6 550068       2.04    0.76       2
## Stay_In_Current_City_Years*    7 550068       2.86    1.29       3
## Marital_Status                 8 550068       0.41    0.49       0
## Product_Category_1             9 550068       5.40    3.94       5
## Product_Category_2            10 376430       9.84    5.09       9
## Product_Category_3            11 166821      12.67    4.13      14
## Purchase                      12 550068    9263.97 5023.07    8047
##                                trimmed     mad     min     max range  skew
## User_ID                     1003026.91 2176.46 1000001 1006040  6039  0.00
## Product_ID*                    1688.69 1197.94       1    3631  3630  0.15
## Gender*                           1.82    0.00       1       2     1 -1.17
## Age*                              3.36    1.48       1       7     6  0.81
## Occupation                        7.69    8.90       0      20    20  0.40
## City_Category*                    2.05    1.48       1       3     2 -0.07
## Stay_In_Current_City_Years*       2.82    1.48       1       5     4  0.32
## Marital_Status                    0.39    0.00       0       1     1  0.37
## Product_Category_1                4.90    4.45       1      20    19  1.03
## Product_Category_2                9.99    7.41       2      18    16 -0.16
## Product_Category_3               13.07    2.97       3      18    15 -0.77
## Purchase                       8929.12 4256.54      12   23961 23949  0.60
##                             kurtosis   se
## User_ID                        -1.20 2.33
## Product_ID*                    -1.09 1.36
## Gender*                        -0.62 0.00
## Age*                            0.30 0.00
## Occupation                     -1.22 0.01
## City_Category*                 -1.27 0.00
## Stay_In_Current_City_Years*    -1.07 0.00
## Marital_Status                 -1.86 0.00
## Product_Category_1              1.23 0.01
## Product_Category_2             -1.43 0.01
## Product_Category_3             -0.81 0.01
## Purchase                       -0.34 6.77

Contingency tables showing the affect of various factors on the starting salary.

library(psych)
headTail(xtabs(~Purchase+Marital_Status,data = sale.df))
##        X0  X1
## 12     57  44
## 13     63  43
## 14     53  42
## 24     78  40
## ...   ... ...
## 23958   2   2
## 23959   1   1
## 23960   1   3
## 23961   2   1
library(psych)
headTail(xtabs(~Purchase+Age,data=sale.df))
##       X0.17 X18.25 X26.35 X36.45 X46.50 X51.55 X55.
## 12        3     20     29     23     12      7    7
## 13        3     17     50     16     10      5    5
## 14        2     19     33     19      7     13    2
## 24        5     21     46     22      9      8    7
## ...     ...    ...    ...    ...    ...    ...  ...
## 23958     0      2      0      0      0      1    1
## 23959     0      0      1      0      0      1    0
## 23960     0      0      0      1      1      1    1
## 23961     0      0      3      0      0      0    0
library(psych)
headTail(xtabs(~Purchase+Gender,data = sale.df))
##         F   M
## 12     27  74
## 13     25  81
## 14     30  65
## 24     28  90
## ...   ... ...
## 23958   0   4
## 23959   1   1
## 23960   0   4
## 23961   0   3

Visualizations

GENDER DISTRIBUTION

plot(sale.df$Gender,ylim =c(0,420000),xlab=c("GENDER F=FEMALE M=MALE"),main="GENDER DISTRIBUTION",col="lightgreen")

table(sale.df$Gender)
## 
##      F      M 
## 135809 414259

PURCHASE AMOUNT ACCORDING TO GENDER

plot(sale.df$Gender,sale.df$Purchase,main="BOXPLOT OF PURCHASE AMOUNT ACCORDING TO GENDER",xlab="GENDER",ylab="PURCHASE AMOUNT",col ="peachpuff")

## MARTIAL STATUS OF CUSTOMER

sale.df$Marital_Status <- factor(sale.df$Marital_Status,levels = c(0,1),labels = c("UNMARRIED","MARRIED"))
plot(sale.df$Marital_Status,ylim=c(0,330000),main="COUNT OF MARTIAL STATUS",col="lightblue")

BOXPLOT OF PURCHASE AMOUNT VS MARTIAL STATUS

sale.df$Marital_Status[sale.df$Marital_Status== "0"] <- "UNMARRIED"
sale.df$Marital_Status[sale.df$Marital_Status== "1"] <- "MARRIED"
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
data("sale.df",package = "ggplot2")
## Warning in data("sale.df", package = "ggplot2"): data set 'sale.df' not
## found
ggplot(sale.df,aes(x=Marital_Status,y=Purchase))+geom_boxplot(col="tan2",size=1)+
  labs(title="BOXPLOT",subtitle="PURCHASE AMOUNT VS MARITAL STATUS",y="PURCHASE AMOUNT")

TOTAL CUSTOMER AND THEIR AGES

library(ggplot2)
data("sale.df",package = "ggplot2")
## Warning in data("sale.df", package = "ggplot2"): data set 'sale.df' not
## found
g1<- ggplot(sale.df,aes(x=Age))+geom_bar(size=2,col="steelblue2")+
  labs(y="PURCHASE AMOUNT")
plot(g1)

It is clear that people between age 26-35 spends more than all other age groups and age group 36-45 is after age group 26-35.

PURCHASE VS CUSTOMER STAY IN CITY

plot(sale.df$Stay_In_Current_City_Years,col="steelblue2",main="PURCHASE AMOUNT AND STAY IN CURRENT CITY OF CUSTOMER",ylab="PURCHASE AMOUNT",xlabs="STAY IN CURRENT CITY IN YEARS")
## Warning in plot.window(xlim, ylim, log = log, ...): "xlabs" is not a
## graphical parameter
## Warning in axis(if (horiz) 2 else 1, at = at.l, labels = names.arg, lty =
## axis.lty, : "xlabs" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "xlabs" is not a graphical parameter
## Warning in axis(if (horiz) 1 else 2, cex.axis = cex.axis, ...): "xlabs" is
## not a graphical parameter

From above plot we get that Customer who live in current city form 1-2 years shop more than others.

PRODUCT CATEGORY 1 SALES

library(ggplot2)
ggplot(sale.df,aes(x=Product_Category_1,y=Purchase))+geom_col(aes(col=Gender))

## PRODUCT CATEGORY 2 SALES

library(ggplot2)
ggplot(sale.df,aes(x=Product_Category_2,y=Purchase))+geom_col(col="green")
## Warning: Removed 173638 rows containing missing values (position_stack).

library(ggplot2)
ggplot(sale.df,aes(x=Product_Category_3,y=Purchase))+geom_col(col="lightblue")
## Warning: Removed 383247 rows containing missing values (position_stack).

PURCHASE AMOUNT OF PRODUCT CATEGORIES

library(ggplot2)
theme_set(theme_classic())
data("sale.df",package = "ggplot2")
## Warning in data("sale.df", package = "ggplot2"): data set 'sale.df' not
## found
ggplot(sale.df,aes(x=Product_Category_1,y=Purchase))+geom_boxplot(,size=2)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

boxplot(sale.df$Purchase,sale.df$Product_Category_2,col="blue")

boxplot(sale.df$Purchase,sale.df$Product_Category_3,col="blue")

AGE VS PURPOSE

library(ggplot2)
g1<-ggplot(sale.df, aes(x=Age, y=Purchase)) + geom_col(aes(col=Gender),size=2,width = 0.8)

plot(g1)

library(ggplot2)
sale.df$Marital_Status[sale.df$Marital_Status== "0"] <- "UNMARRIED"
sale.df$Marital_Status[sale.df$Marital_Status== "1"] <- "MARRIED"


library(ggplot2)
g1<-ggplot(sale.df, aes(x=Gender, y=Purchase)) + geom_col(aes(col=Marital_Status),size=2,width = 0.8)
plot(g1)

library(ggplot2)
g1<-ggplot(sale.df, aes(x=Marital_Status, y=Purchase)) + geom_col(aes(col=Gender),size=2,width = 0.8)

plot(g1)

By looking at these plots , It is obvious that the unmarried males are buying more! And in case of Females.

plot(sale.df$City_Category)

City category B has more sales than others. ## Occupation Vs Purchase

library(ggplot2)
g1<-ggplot(sale.df, aes(x=Occupation, y=Purchase)) + geom_col(col="lightblue",size=2,width = 0.8)

plot(g1)

CORRELATION MATRIX

round(cor(Filter(is.numeric, sale.df)),2)
##                    User_ID Occupation Product_Category_1
## User_ID               1.00      -0.02               0.00
## Occupation           -0.02       1.00              -0.01
## Product_Category_1    0.00      -0.01               1.00
## Product_Category_2      NA         NA                 NA
## Product_Category_3      NA         NA                 NA
## Purchase              0.00       0.02              -0.34
##                    Product_Category_2 Product_Category_3 Purchase
## User_ID                            NA                 NA     0.00
## Occupation                         NA                 NA     0.02
## Product_Category_1                 NA                 NA    -0.34
## Product_Category_2                  1                 NA       NA
## Product_Category_3                 NA                  1       NA
## Purchase                           NA                 NA     1.00

CORRGRAM

library(corrgram)
corrgram(sale.df, upper.panel=panel.pie,main= "Corrgram of store variables" )

## T test

Hypothesis: There is differenece between Purchase Amount Males and Females. Null Hypothesis: There is no differenece between Purchase Amount Males and Females.

sale.df$Gender <- ifelse(sale.df$Gender == "F", 1, 0)
t.test(sale.df$Purchase,sale.df$Gender)
## 
##  Welch Two Sample t-test
## 
## data:  sale.df$Purchase and sale.df$Gender
## t = 1367.8, df = 550070, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  9250.448 9276.996
## sample estimates:
##    mean of x    mean of y 
## 9263.9687130    0.2468949

P-value<0.5.Hence, we Reject the Null Hypothesis That there is no differenece between Purchase Amount Males and Females.

CORRELATION TESTS

  1. Correlation between Purchase Amount And Gender of customer.
#Conerting Gender to Binary

sale.df$Gender <- ifelse(sale.df$Gender == "F", 1, 0)

cor.test(sale.df$Purchase,sale.df$Gender)
## Warning in cor(x, y): the standard deviation is zero
## 
##  Pearson's product-moment correlation
## 
## data:  sale.df$Purchase and sale.df$Gender
## t = NA, df = 550070, p-value = NA
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  NA NA
## sample estimates:
## cor 
##  NA

As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and Gender .

2.Test Correlation between Purchase Amount And stay in current city of customer.

sale.df$Stay_In_Current_City_Years[sale.df$Stay_In_Current_City_Years == "4+"] <- "4"
## Warning in `[<-.factor`(`*tmp*`, sale.df$Stay_In_Current_City_Years ==
## "4+", : invalid factor level, NA generated
sale.df$Stay_In_Current_City_Years <- as.integer(sale.df$Stay_In_Current_City_Years)
cor.test(sale.df$Stay_In_Current_City_Years,sale.df$Purchase)
## 
##  Pearson's product-moment correlation
## 
## data:  sale.df$Stay_In_Current_City_Years and sale.df$Purchase
## t = 4.9632, df = 465340, p-value = 6.937e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.004402452 0.010148499
## sample estimates:
##         cor 
## 0.007275536

As p-value>0.5,we can conclude that there is no relation between Purchase Amount and stay in current city in Years.

3.Test Correlation between Purchase Amount And stay in Occupation of customer.

cor.test(sale.df$Purchase,sale.df$Occupation)
## 
##  Pearson's product-moment correlation
## 
## data:  sale.df$Purchase and sale.df$Occupation
## t = 15.454, df = 550070, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01819097 0.02347398
## sample estimates:
##        cor 
## 0.02083262

As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and Occupation of Customer.

3.Test Correlation between Purchase Amount And product in category 1.

cor.test(sale.df$Purchase,sale.df$Product_Category_1)
## 
##  Pearson's product-moment correlation
## 
## data:  sale.df$Purchase and sale.df$Product_Category_1
## t = -271.45, df = 550070, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3460317 -0.3413708
## sample estimates:
##        cor 
## -0.3437033

As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and product in category 1.

4.Test Correlation between Purchase Amount And product in category 2.

cor.test(sale.df$Purchase,sale.df$Product_Category_2)
## 
##  Pearson's product-moment correlation
## 
## data:  sale.df$Purchase and sale.df$Product_Category_2
## t = -131.73, df = 376430, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2129702 -0.2068627
## sample estimates:
##        cor 
## -0.2099185

As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and product in category 2.

5.Test Correlation between Purchase Amount And product in category 3.

cor.test(sale.df$Purchase,sale.df$Product_Category_3)
## 
##  Pearson's product-moment correlation
## 
## data:  sale.df$Purchase and sale.df$Product_Category_3
## t = -8.9901, df = 166820, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02680159 -0.01720885
## sample estimates:
##         cor 
## -0.02200573

As p-value<0.5,we can conclude that there is indeed a relation between Purchase Amount and product in category 3.

MODEL

Hypothesis: There is differenece between Purchase Amount Males and Females. Null Hypothesis: There is no differenece between Purchase Amount Males and Females.

In order to test Hypothesis 1a, we proposed the following model:

Model1 <- (Purchase~Gender+Occupation+City_Category+Stay_In_Current_City_Years+Product_Category_1+Marital_Status+Product_Category_2+Product_Category_3)
fit1 <- lm(Model1, data = sale.df)
summary(fit1)
## 
## Call:
## lm(formula = Model1, data = sale.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10748.2  -2809.1   -433.2   2760.6  19914.6 
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                12258.330     57.139  214.536  < 2e-16 ***
## Gender                            NA         NA       NA       NA    
## Occupation                    12.121      1.906    6.359 2.03e-10 ***
## City_CategoryB               219.141     31.224    7.018 2.26e-12 ***
## City_CategoryC               837.906     32.725   25.605  < 2e-16 ***
## Stay_In_Current_City_Years    19.963     12.467    1.601   0.1093    
## Product_Category_1          -836.038      5.551 -150.604  < 2e-16 ***
## Marital_StatusMARRIED         44.223     25.167    1.757   0.0789 .  
## Product_Category_2            26.114      3.677    7.103 1.23e-12 ***
## Product_Category_3            76.263      3.564   21.397  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4637 on 141450 degrees of freedom
##   (408609 observations deleted due to missingness)
## Multiple R-squared:  0.1689, Adjusted R-squared:  0.1688 
## F-statistic:  3593 on 8 and 141450 DF,  p-value: < 2.2e-16
library(leaps)
leap <- regsubsets(Model1, data = sale.df, nbest=1)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 1 linear dependencies found
## Reordering variables and trying again:
summary(leap)
## Subset selection object
## Call: regsubsets.formula(Model1, data = sale.df, nbest = 1)
## 9 Variables  (and intercept)
##                            Forced in Forced out
## Occupation                     FALSE      FALSE
## City_CategoryB                 FALSE      FALSE
## City_CategoryC                 FALSE      FALSE
## Stay_In_Current_City_Years     FALSE      FALSE
## Product_Category_1             FALSE      FALSE
## Marital_StatusMARRIED          FALSE      FALSE
## Product_Category_2             FALSE      FALSE
## Product_Category_3             FALSE      FALSE
## Gender                         FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          Gender Occupation City_CategoryB City_CategoryC
## 1  ( 1 ) " "    " "        " "            " "           
## 2  ( 1 ) " "    " "        " "            " "           
## 3  ( 1 ) " "    " "        " "            "*"           
## 4  ( 1 ) " "    " "        "*"            "*"           
## 5  ( 1 ) " "    " "        "*"            "*"           
## 6  ( 1 ) " "    "*"        "*"            "*"           
## 7  ( 1 ) " "    "*"        "*"            "*"           
## 8  ( 1 ) " "    "*"        "*"            "*"           
##          Stay_In_Current_City_Years Product_Category_1
## 1  ( 1 ) " "                        "*"               
## 2  ( 1 ) " "                        "*"               
## 3  ( 1 ) " "                        "*"               
## 4  ( 1 ) " "                        "*"               
## 5  ( 1 ) " "                        "*"               
## 6  ( 1 ) " "                        "*"               
## 7  ( 1 ) " "                        "*"               
## 8  ( 1 ) "*"                        "*"               
##          Marital_StatusMARRIED Product_Category_2 Product_Category_3
## 1  ( 1 ) " "                   " "                " "               
## 2  ( 1 ) " "                   " "                "*"               
## 3  ( 1 ) " "                   " "                "*"               
## 4  ( 1 ) " "                   " "                "*"               
## 5  ( 1 ) " "                   "*"                "*"               
## 6  ( 1 ) " "                   "*"                "*"               
## 7  ( 1 ) "*"                   "*"                "*"               
## 8  ( 1 ) "*"                   "*"                "*"
plot(leap, scale="adjr2")

## MODEL2

Hence, by eliminating few variables from Model 1, we predict the Model 2 by

y=B0+B1x1+B2x2+B3x3+B4x4+B5x5

where,

y= Purchase amount of Customer, x1= Stay_In_Current_City_Years, x2= Product_Category_1, x3= Marital_Status, x4= Product_Category_2, x5= Product_Category_3,

Model2 <- (Purchase~Stay_In_Current_City_Years+Product_Category_1+Marital_Status+Product_Category_2+Product_Category_3)
fit12 <- lm(Model2, data = sale.df)
summary(fit12)
## 
## Call:
## lm(formula = Model2, data = sale.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10294  -2875   -518   2707  19551 
## 
## Coefficients:
##                             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                12706.183     51.918  244.734  < 2e-16 ***
## Stay_In_Current_City_Years    23.859     12.492    1.910  0.05614 .  
## Product_Category_1          -840.728      5.565 -151.084  < 2e-16 ***
## Marital_StatusMARRIED         79.658     25.198    3.161  0.00157 ** 
## Product_Category_2            26.878      3.687    7.289 3.13e-13 ***
## Product_Category_3            76.660      3.574   21.448  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4650 on 141453 degrees of freedom
##   (408609 observations deleted due to missingness)
## Multiple R-squared:  0.164,  Adjusted R-squared:  0.164 
## F-statistic:  5552 on 5 and 141453 DF,  p-value: < 2.2e-16

From Model 2, We established the effect of Product Categeory,Marital status and Stay in current city on Purchase amount of customer with the simplest model. We regressed Purchase amount of customer on different products.We estimated model, using simple linear Regressional Model.

library(leaps)
leap1 <- regsubsets(Model2, data = sale.df, nbest=1)
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(Model2, data = sale.df, nbest = 1)
## 5 Variables  (and intercept)
##                            Forced in Forced out
## Stay_In_Current_City_Years     FALSE      FALSE
## Product_Category_1             FALSE      FALSE
## Marital_StatusMARRIED          FALSE      FALSE
## Product_Category_2             FALSE      FALSE
## Product_Category_3             FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          Stay_In_Current_City_Years Product_Category_1
## 1  ( 1 ) " "                        "*"               
## 2  ( 1 ) " "                        "*"               
## 3  ( 1 ) " "                        "*"               
## 4  ( 1 ) " "                        "*"               
## 5  ( 1 ) "*"                        "*"               
##          Marital_StatusMARRIED Product_Category_2 Product_Category_3
## 1  ( 1 ) " "                   " "                " "               
## 2  ( 1 ) " "                   " "                "*"               
## 3  ( 1 ) " "                   "*"                "*"               
## 4  ( 1 ) "*"                   "*"                "*"               
## 5  ( 1 ) "*"                   "*"                "*"
plot(leap1, scale="adjr2")

Coefficients

coefficients(fit12)
##                (Intercept) Stay_In_Current_City_Years 
##                12706.18347                   23.85871 
##         Product_Category_1      Marital_StatusMARRIED 
##                 -840.72818                   79.65808 
##         Product_Category_2         Product_Category_3 
##                   26.87784                   76.65997

Above Are the Beta-coefficients according to respective Variable.

Results

We found the Model 2 was best fit model,with p-value < 0.5.Purchase amount of customer mostly depend upon Product Categeory and Marital status of costomer.

Insights

1.Purchase amount for males is high except few categoires. so it might possible that those categories contain items which are used by females more than males otherwise males should be our main focus.

2.People between age 26-35 spends more than all other age groups and age group 36-45 is after age group 26-35.Almost same trend goes on for all the product category so we should focus more on these two age groups along with age group 18-25.

3.People who are not married are spending more than people who are not married especially Males.There are more changes in female purchase after marriage than male purchase.

  1. Purchase amount for city category B is high than other cities.

Conclusion:

This paper was motivated by the need for research that could improve our understanding the behavior of customer regaurding various products anf factors. The unique contribution of this paper is that we investigated the Purchase amount of customer over a stores and studied their behavior. We found that Purchase amount for males is high except few categoires. we observed People who are not married are spending more than people who are married especially Males.There are more chances in female purchase after marriage than male purchase.