RQ: How does the total expenditure per transaction depend on the quantity of products purchased, the price per unit in the purchase and the gender?

Follow up question: Does adding a new variable-the category of a product (Electronics, Clothing, or Beauty) make any significant difference in explaining the total expenditure of the transaction?

mydataaa <- read.table("~/hw17/retail_sales_dataset_manipulatedd.csv",
                    header = TRUE,
                    sep=";",
                    dec=".") #Creating the dataset
head(mydataaa[, -c(2,5,9,10,11,12)])
##   Transaction.ID Customer.ID Gender Product.Category Quantity Price.per.Unit
## 1              1     CUST002 Female         Clothing        2            500
## 2              2     CUST006 Female           Beauty        1             30
## 3              3     CUST010 Female         Clothing        4             50
## 4              4     CUST015 Female      Electronics        4            500
## 5              5     CUST017 Female         Clothing        4             25
## 6              6     CUST018 Female      Electronics        2             25
##   Total
## 1  2400
## 2    72
## 3   480
## 4  4800
## 5   240
## 6   120

Explanation of dataset: This data presents retail information recorded, including most important drivers of retail operations, which are Transaction ID, Customer ID, Gender, Product Category, Quantity, and Total Amount. They help with predicting sales trends, demographic influences and purchased behaviors.

The data has 1000 observations with 8 variables, where:

Source of this dataset is from kaggle website:

Ranitsarkar. (2023, March 4). Yulu_Analysis-Hypothesis testing. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset

Unit of observation is a customer.

This dataset is too big (1000) and many outliers, so I will sample it to have less outliers to remove and has too many variables (irrelevant to the research question), so I am cleaning it and explaining the clean data below.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(1)
mydataa <- mydataaa %>% sample_n(size = 500, replace = FALSE) #random sampling the data to a sample with 500 units.
mydata <- mydataa[, -c(2,3,5,9,10,11,12,14,15,16)] #removing the irrelevant variables
mydata$Gender <- factor(mydata$Gender,
                         levels = c("Male", "Female"),
                         labels = c("Male", "Female"))
mydata$Product.Category <- factor(mydata$Product.Category,
                         levels = c("Beauty", "Electronics", "Clothing"),
                         labels = c("Beauty", "Electronics", "Clothing"))

Factoring out the categorical variables.

any(is.na(mydata))
## [1] FALSE

There are no N/As.

summary(mydata[c("Total", "Quantity","Product.Category", "Price.per.Unit", "Gender")])
##      Total           Quantity       Product.Category Price.per.Unit 
##  Min.   :  25.0   Min.   :1.00   Beauty     :145     Min.   : 25.0  
##  1st Qu.: 100.0   1st Qu.:1.00   Electronics:179     1st Qu.: 30.0  
##  Median : 240.0   Median :2.00   Clothing   :176     Median : 50.0  
##  Mean   : 724.7   Mean   :2.47                       Mean   :170.5  
##  3rd Qu.: 900.0   3rd Qu.:3.00                       3rd Qu.:300.0  
##  Max.   :4800.0   Max.   :4.00                       Max.   :500.0  
##     Gender   
##  Male  :250  
##  Female:250  
##              
##              
##              
## 

There are eaxctly 250 males and exactly 250 females in this random sample.

Out of three product categories, 145 customers purchased Beauty products, 179 purchased 179 products and 176 purchased 176 products.

Customers on average purchased 2,47 products per order, but maybe more appropriate explanation here is that most customers purchased 2 products per order.

Average price per unit purchased was 170,5 monetary units, with maximum price being 500 and minimum 25 monetary units. 25% of units purchased had a price of less than 30 monetary units, and 75% costed 300 or more monetary units. Most of them costed 50 monetary units.

Customers spent on average 724,7 monetary units per order. 25% of customers spent less than 100 monetary units, with minimum spending per transaction being 25 monetary units. 75% of customers spent at least 900 monetary units with maximum spending being 4800 monetary units.

Regression Analysis

Assumptions:

Requirements:

The dependent variable is numerical, the explanatory variables can be numerical or dichotomous. This requirement is fulfilled. Each explanatory variable must vary. None of the variables has nonzero variance so this assumption is met as well. Outliers and units of high impact are removed from the data. Not too strong multicolinearity.

Let’s fit the first model and check assumptions and requirements.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata[ , c(-1,-2,-3,-7,-8)], #Scatterplot matrix
                  smooth = FALSE) 

From the graph and the feedback from my last homework, I am not sure if the linearity is met. From the scatterplots among explanatory variables I assume that multicolineatity isn’t a problem, but I will aslo check it with variance inflation factor VIF.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -2,-3,-7,-8)])) #Correlation matrix
##                Quantity Price.per.Unit Total
## Quantity           1.00           0.04  0.35
## Price.per.Unit     0.04           1.00  0.76
## Total              0.35           0.76  1.00
## 
## n= 500 
## 
## 
## P
##                Quantity Price.per.Unit Total 
## Quantity                0.3304         0.0000
## Price.per.Unit 0.3304                  0.0000
## Total          0.0000   0.0000

There could be statistically significant correlation between Price per unit and Quantity, so let’s check VIF for the multicolinearity:

library(car)
fit1 <- lm(Total ~ Quantity +  Price.per.Unit + Product.Category +Gender,
           data = mydata)

vif(fit1)
##                      GVIF Df GVIF^(1/(2*Df))
## Quantity         1.005027  1        1.002511
## Price.per.Unit   1.002417  1        1.001208
## Product.Category 1.000857  2        1.000214
## Gender           1.003953  1        1.001975
mean(vif(fit1))
## [1] 1.084847

As predicted, all the values are below 5, and the mean of variance inflation factor is close to one.

No perfect multicolineartity assumption is met, as well as the not too strong mulicolinearity requirement.

mydata$stdResid <- round(rstandard(fit1), 3) #standardized residuals 
mydata$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances

hist (mydata$stdResid,
      xlab = "Standardized residuals",
      ylab = "Frequency",
      main = "Histogram of standardized residuals")

Let’s check if they are normally distributed with SHapiro Wilk normality test even though it wouldn’t matter as I have big enough (500) sample size.

shapiro.test(mydata$stdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$stdResid
## W = 0.91288, p-value = 2.339e-16

H0: Errors are normally distributed.

H1: Errors are not normally distributed.

We reject the null hypothesis (p<0.001) and conclude that errors are not normally distributed, however, this does not matter since the sample size is sufficiently big.

Also, from the histogram of standardized residuals, it can be seen that there are some outlier since some values are smaller than -3 and higher than 3. We will filter those units out later.

Let’s check outliers with cooks distances:

hist(mydata$CooksD,
     xlab = "Cooks DIstances",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

No values above 1, but I can see some gaps, so there are probably some units with high impact, which I will remove:

head(mydata[order(mydata$stdResid), c("Transaction.ID","stdResid")],3) #Highest values of standardized residuals
##     Transaction.ID stdResid
## 80             843   -1.846
## 235            955   -1.846
## 262            553   -1.846

looks ok, because they are all <3

head(mydata[order(-mydata$stdResid), c("Transaction.ID", "stdResid")], 11)  #Lowest values of standardized residuals
##     Transaction.ID stdResid
## 22              37    3.971
## 431             69    3.971
## 11             307    3.955
## 168            247    3.955
## 214            140    3.955
## 119            290    3.886
## 135            214    3.886
## 155            327    3.886
## 215            126    3.886
## 237            309    3.886
## 91              29    2.151

First 10 units have std res values above 3, so I will remove them

head(mydata[order(-mydata$CooksD), c("Transaction.ID", "CooksD")], 11)
##     Transaction.ID CooksD
## 11             307  0.049
## 168            247  0.049
## 214            140  0.049
## 22              37  0.046
## 431             69  0.046
## 119            290  0.044
## 135            214  0.044
## 155            327  0.044
## 215            126  0.044
## 237            309  0.044
## 351             25  0.012

There is a big difference between 10th and 11th unit’s cooks distance. Units with Transaction ID 307,247,140,37,69,290,214,327,126,309 are going out.

mydata1 <- mydata %>%
  filter(!Transaction.ID %in% c(307,247,140,37,69,290,214,327,126,309))


head(mydata1[order(-mydata1$stdResid),], 3)
##     Transaction.ID Gender Product.Category Quantity Price.per.Unit Total
## 89              29 Female      Electronics        3            500  3600
## 114            316 Female      Electronics        3            500  3600
## 176             65 Female      Electronics        3            500  3600
##     stdResid CooksD
## 89     2.151  0.011
## 114    2.151  0.011
## 176    2.151  0.011

Dataset with 10 removed units.

library(car)
scatterplotMatrix(mydata1[ , c(-1,-2,-3,-7,-8,-9)], #Scatterplot matrix
                  smooth = FALSE) 

Everything looks the same as without the removed units.

fit0 <- lm (Total ~ Quantity + Price.per.Unit + Product.Category + Gender,
           data = mydata1)

mydata1$stdFitted <- scale(fit0$fitted.values)

library(car)
scatterplot(stdResid ~ stdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            smooth = FALSE,
            regLine = FALSE,
            boxplots = FALSE,
            data=mydata1)

I can see heteroscendasticity (homoscedasticity being violated), but let’s also check it with Breusch Pagan test:

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit0)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Total 
##  Variables: fitted values of Total 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    212.5306 
##  Prob > Chi2   =    3.852833e-48

H0: The variance is constant (homoskedasticity

H1: The variance is not constant - (heteroskedasticity)

We reject the H0 at p>0,001 and conclude the assumption of variance being constant is violated. This means I will need to use robust standard errors, because they are biased.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(mydata1, aes(y = Total, x = Quantity)) +
  geom_point(colour = "darkred") +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
  ylab("Total Amount of money spent per order") + 
  xlab("Quantity of products purchased") +
  
  theme_dark() +
  geom_text(
    label = mydata1$Transaction.ID, 
    nudge_x = 0.50, nudge_y = 0.3, 
    check_overlap = T, size = 2.5) +
  geom_hline(yintercept = mean(mydata1$Total), linetype = "dashed", colour = "blue") +
  geom_vline(xintercept = mean(mydata1$Quantity), linetype = "dashed", colour = "blue") 

Here I just wanted to check how much of the differences between estimated and fitted values is explained and explained for the independent variable Quantity ofr the dependend variable Total Value of the purchase. For example, for unit #597, the distance above the orange line and the red dot is unexplained, and the distance between the blue line and orange line is explained. Both of them combined-the vertical distance between the blue line and the red dot is the entire deviation of this individual unit. This is just for better vizualisation, it is not one of the steps of the regression analysis.

library(ggplot2)
ggplot(mydata1, aes(y = Total, x = Price.per.Unit)) +
  geom_point(colour = "darkred") +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
  ylab("Total Amount of money spent per order") + 
  xlab("Price per unit") +
  
  theme_dark() +
  geom_text(
    label = mydata1$Price.per.Unit, 
    nudge_x = 0.50, nudge_y = 0.3, 
    check_overlap = T, size = 2.5) +
  geom_hline(yintercept = mean(mydata1$Total), linetype = "dashed", colour = "blue") +
  geom_vline(xintercept = mean(mydata1$Price.per.Unit), linetype = "dashed", colour = "blue") 

This is just for better vizualisation, it is not one of the steps of the regression analysis.

library(ggplot2)
ggplot(mydata1, aes(y = Total, x = Quantity, color = Gender)) +
  geom_point(aes(shape = Gender)) +
  scale_color_manual(values = c("Male" = "darkorange", "Female" = "darkred")) +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  ylab("Total amount") + 
  xlab("Quantity") +
  
  theme_bw() +
  geom_text(
    label = mydata1$Transaction.ID, 
    nudge_x = 0.50, nudge_y = 0.3, 
    check_overlap = T, size = 2.5) +
  labs(color = "Gender", shape = "Gender")

Here, I just wanted to present how much on average Females spend more in a purchase than Males do (we capture this with a dummy variable)(only quantity is given here, not Price per unit). This is just for better vizualisation, it is not one of the steps of the regression analysis.

Let’s continue the regression analysis:

fit1 <- lm(Total ~ Quantity + Price.per.Unit +  Gender ,
           data = mydata1)

summary(fit1)
## 
## Call:
## lm(formula = Total ~ Quantity + Price.per.Unit + Gender, data = mydata1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -839.7 -293.3 -120.5  243.4 1344.5 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -740.9989    51.7627  -14.31   <2e-16 ***
## Quantity        223.0683    16.8488   13.24   <2e-16 ***
## Price.per.Unit    3.7152     0.1054   35.26   <2e-16 ***
## GenderFemale    469.6378    37.6797   12.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 416.2 on 486 degrees of freedom
## Multiple R-squared:  0.7594, Adjusted R-squared:  0.7579 
## F-statistic: 511.4 on 3 and 486 DF,  p-value: < 2.2e-16

We have to use robust standard errors to get correct regression, since homoscedasticity is violated:

library(estimatr)

fit5 <- lm_robust(Total ~ Quantity + Price.per.Unit + Gender ,
                  se_type="HC1",
                  data = mydata1)
summary(fit5)
## 
## Call:
## lm_robust(formula = Total ~ Quantity + Price.per.Unit + Gender, 
##     data = mydata1, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)    -740.999    59.0754  -12.54 1.837e-31 -857.074 -624.924 486
## Quantity        223.068    17.7123   12.59 1.132e-31  188.266  257.870 486
## Price.per.Unit    3.715     0.1416   26.24 3.732e-95    3.437    3.993 486
## GenderFemale    469.638    38.3170   12.26 2.781e-30  394.350  544.925 486
## 
## Multiple R-squared:  0.7594 ,    Adjusted R-squared:  0.7579 
## F-statistic:   315 on 3 and 486 DF,  p-value: < 2.2e-16

H0: Beta(i) = 0

H1: Beta(i) =/= 0 With all p values>0,001, we can reject the null hypothesis at statistical significance. We have enough statistical evidence that actual regression coefficient Beta differs from 0.

H1: 𝛽1 =/= 0 We reject H0 at p>0,001. Quantity of product purchased per purchase has a significant effect on total value of the purchase. If quantity of products purchased in a purchase increases by one unit (product), the total value of the purchase on average increases by 223,07 monetary units at p>0,001, assuming all other explanatory variables remain the same.

H1: 𝛽2 =/= 0 We reject H0 at p>0,001. Price per unit of the products purchased in the purchase has a significant effect on total value of the purchase. If price per unit of the products purchased increases by one monetary unit, the total value of the purchase on average increases by 3,72 monetary units at p>0,001, assuming all other explanatory variables remain the same.

H1: 𝛽3 =/= 0 We reject H0 at p>0,001. Gender has a significant effect on total value of the purchase. Given values of all other explanatory variables, females will on average spend 469,64 monetary units more on a purchase than male respondends.

Coefficient of determination R - squared = 0.7594 75,94% of the variability of the total value of the pruchase is explained by the variability (or linear effect) of all other variables, included in this model (Quantity, price per unit and gender).

Test of Significance of regression F-statistics (ANOVA) H0: 𝜌2 = 0 or 𝛽1 = 𝛽2 = 𝛽3 = 0

H1: 𝜌2 > 0 or at least one 𝛽𝑗 is different from 0

We reject the null hypothesis at p<0.001 and conclude that at least some of the explanatory variables impact the total value of a purchase.

fit3 <- lm_robust(Total ~ Quantity + Price.per.Unit + Product.Category + Gender ,
                  se_type="HC1",
                  data = mydata1)
summary(fit3)
## 
## Call:
## lm_robust(formula = Total ~ Quantity + Price.per.Unit + Product.Category + 
##     Gender, data = mydata1, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                             Estimate Std. Error  t value  Pr(>|t|) CI Lower
## (Intercept)                 -752.573    67.6124 -11.1307 8.867e-26 -885.423
## Quantity                     223.074    17.7345  12.5785 1.362e-31  188.227
## Price.per.Unit                 3.715     0.1417  26.2248 5.619e-95    3.437
## Product.CategoryElectronics   15.015    45.9296   0.3269 7.439e-01  -75.231
## Product.CategoryClothing      17.921    43.6075   0.4110 6.813e-01  -67.763
## GenderFemale                 469.394    38.3479  12.2404 3.351e-30  394.045
##                             CI Upper  DF
## (Intercept)                 -619.723 484
## Quantity                     257.920 484
## Price.per.Unit                 3.994 484
## Product.CategoryElectronics  105.261 484
## Product.CategoryClothing     103.604 484
## GenderFemale                 544.743 484
## 
## Multiple R-squared:  0.7595 ,    Adjusted R-squared:  0.757 
## F-statistic: 193.7 on 5 and 484 DF,  p-value: < 2.2e-16

Even though adding a variable of Product category, which is statistically significant, the model did not improve as it’s still explaining 76% of the data, so we do not include it in our regression model and go for the simpler one without it.

sqrt(summary(fit5)$r.squared)
## [1] 0.8714526

Multiple regression coefficient - the square root of the R squared tells us the relationship between all the variables in the model(simpler one without the category of the products added).

The relationship between the dependent variable total value of the purchase, and quantity, price per unit, and gender (explanatory) variables is strong.

We do not include sign here (positive/negative) since it has many dimensions (relationship between 4 variables).