Metka Pintar

RQ1: Can differences in the total amount of a transaction among customers be explained by the quantity of products, product category, price per unit and customer’s age?

1. Data

Source of the data: https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset/data
#Showing first 6 rows of the data
mydata <- read.table("./retail_sales_dataset.csv", header=TRUE, sep = ",", dec = ",") #Reading the data
head(mydata)
##   Transaction.ID       Date Customer.ID Gender Age Product.Category
## 1              1 2023-11-24     CUST001   Male  34           Beauty
## 2              2 2023-02-27     CUST002 Female  26         Clothing
## 3              3 2023-01-13     CUST003   Male  50      Electronics
## 4              4 2023-05-21     CUST004   Male  37         Clothing
## 5              5 2023-05-06     CUST005   Male  30           Beauty
## 6              6 2023-04-25     CUST006 Female  45           Beauty
##   Quantity Price.per.Unit Total.Amount
## 1        3             50          150
## 2        2            500         1000
## 3        1             30           30
## 4        1            500          500
## 5        2             50          100
## 6        1             30           30

Unit of observation: 1 customer (not specified where, when; crafted data)

Sample size: 1000

Definitions and units of variables:

  • Transaction.ID: A unique identifier for each transaction.
  • Date: The date when the transaction occurred (year, month, day).
  • Customer ID: A unique identifier for each customer.
  • Gender: The gender of the customer (Male,Female).
  • Age: The age of the customer (years).
  • Product.Category:The category of the purchased product (Electronics, Clothing, Beauty).
  • Quantity: The number of units of the product purchased.
  • Price.per.Unit: The price of one unit of the product(money unit is not specified).
  • Total.Amount: The total monetary value of the transaction (money unit is not specified).
#Including only 5th, 6th, 7th, 8th and 9th column (only needed data)
mydata4 <- mydata[, c(5,6,7,8,9)] 
head(mydata4)
##   Age Product.Category Quantity Price.per.Unit Total.Amount
## 1  34           Beauty        3             50          150
## 2  26         Clothing        2            500         1000
## 3  50      Electronics        1             30           30
## 4  37         Clothing        1            500          500
## 5  30           Beauty        2             50          100
## 6  45           Beauty        1             30           30
#Creating a factor variable for categorical variable (product category)  
mydata4$Product.CategoryFactor <- factor(mydata4$Product.Category, 
                                levels = c("Clothing", "Beauty", "Electronics"), 
                                labels = c("Clothing", "Beauty", "Electronics")) 

   
head(mydata4, 6)
##   Age Product.Category Quantity Price.per.Unit Total.Amount
## 1  34           Beauty        3             50          150
## 2  26         Clothing        2            500         1000
## 3  50      Electronics        1             30           30
## 4  37         Clothing        1            500          500
## 5  30           Beauty        2             50          100
## 6  45           Beauty        1             30           30
##   Product.CategoryFactor
## 1                 Beauty
## 2               Clothing
## 3            Electronics
## 4               Clothing
## 5                 Beauty
## 6                 Beauty
#Showing the descriptive statistics
summary(mydata4) 
##       Age        Product.Category      Quantity     Price.per.Unit 
##  Min.   :18.00   Length:1000        Min.   :1.000   Min.   : 25.0  
##  1st Qu.:29.00   Class :character   1st Qu.:1.000   1st Qu.: 30.0  
##  Median :42.00   Mode  :character   Median :3.000   Median : 50.0  
##  Mean   :41.39                      Mean   :2.514   Mean   :179.9  
##  3rd Qu.:53.00                      3rd Qu.:4.000   3rd Qu.:300.0  
##  Max.   :64.00                      Max.   :4.000   Max.   :500.0  
##   Total.Amount  Product.CategoryFactor
##  Min.   :  25   Clothing   :351       
##  1st Qu.:  60   Beauty     :307       
##  Median : 135   Electronics:342       
##  Mean   : 456                         
##  3rd Qu.: 900                         
##  Max.   :2000

Explanation of the descriptive statistics:

  • Max Age = 64; The oldest customer was 64 years old.
  • Median Quantity = 3; Half of the customers bought up to 3 products, the other half bought more than 3 products.
  • Mean Total.Amount = 456; The average total monetary value of the transaction is 456 monetary units.
  • 3rd quantile Price.per.Unit = 300; 75% of the products has a price per unit up to 300 units, the price per unit of the other 25% of the products is higher than 300 units.

2. Analysis

Regression function: Total Amount = b0 + b1Age + b2Quantity + b3Price.per.Unit + b4Product.Category(Dummy)

Checking assumptions and additional requirements:

Assumption 1: Linearity in parameters

#Checking linearity 
library(car)
## Loading required package: carData
scatterplotMatrix(mydata4[, c(5,1,3,4)],
                  smooth = FALSE) 

From the scatterplot we can clearly see that there is no problem with linearity. Assumption 1 is met. But we can see that we have problem with hetersoskedasticity for variables Quantity and Price.per.Unit. Will check it later with the test.

Assumption 2: The expected value of the errors equals 0.

I built the model using regression and I included all the important variables in the model. Assumption 2 is met.

Expected effects of explanatory variables on the dependent variable:

  • Age: Positive correlation. Younger people have low income and can afford less. Higher the age, higher the disposable income and greater the affordability, so the total amount of a transaction is higher for older people.
  • Product category: Electronics might contribute more to the total amount compared to beauty products and clothing.
  • Quantity: Positive correlation; as the quantity increases, the total transaction amount also increases.
  • Price per unit: Positive correlation; higher the price, higher the total amount of a transaction.

Additional requirement #1: Non-zero varience of explanatory variables

From the scatterplot we can see that explanatory variables has a wide range of different values. Additional requirement #1 is met.

Additional requirement #2: The dependent variable has to be numerical, the explanatory variables can be numerical or dichotomous

  • Dependent variable is the total amount of a transaction and has monetary units -> numerical variable
  • Explanatory variables age, price per unit and quantity are numerical variables, variable product category is a dummy variable. Adittional requirement #2 is met.

Assumption 3: No perfect multicolinearity + Additional requirement #3: Absence of too strong multicolinearity

#Checking for multicollinearity
fit <- lm(Total.Amount ~ Age + Price.per.Unit + Quantity + Product.CategoryFactor, 
           data = mydata4)

vif(fit)
##                            GVIF Df GVIF^(1/(2*Df))
## Age                    1.004486  1        1.002241
## Price.per.Unit         1.002204  1        1.001102
## Quantity               1.001433  1        1.000716
## Product.CategoryFactor 1.003546  2        1.000885

VIF of Age, Price per unit and Quantity is <5 and VIF for Product category <√5. We have no issues with multicollinearity. Assumption 3 and additional requirement #3 are met.

Assumption 4: The number of units is greater than the number of estimated parameters (n > k)

  • There are 200 units in our sample (n = 200) and we have 5 parameters (k = 5) in our regression model. Assumption 4 is met.

Additional requirement #4: No outliers and high-impact units

#Checking for outliers and units with high influence.
mydata4$StdResid <- round(rstandard(fit), 3) 
mydata4$CooksD <- round(cooks.distance(fit), 3)

hist(mydata4$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

All values are between -3 and +3, there are no outliers.

hist(mydata4$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cook's distances")

The Cook’s distances are below 1, but there is one gap. I have to remove units with high impact.

#Identifying and excluding units with Cook's distance above 0.004
outliers <- which(mydata4$CooksD > 0.004) 
filtered_data <- mydata4[-outliers, ] 

#Refit the model without the units with high impact
fit1 <- lm(Total.Amount ~ Age + Price.per.Unit + Quantity + Product.CategoryFactor, 
           data = mydata4) 

#Checking the histogram again if there are any more units with high impact
hist(filtered_data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cook's distances")

We still have 2 gaps. We have to remove the units.

#Identifying and excluding units with Cook's distance above 0.0005
outliers <- which(mydata4$CooksD > 0.0005) 
filtered_data1 <- mydata4[-outliers, ] 

#Refit the model without the units with high impact
fit2 <- lm(Total.Amount ~ Age + Price.per.Unit + Quantity + Product.CategoryFactor, 
           data = filtered_data1) 

#Checking the histogram again
hist(filtered_data1$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cook's distances")

We do not have any outliers, high-impact units were removed. Additional requirement #4 is met.

Assumption 5: Normal distribution of errors.

#Checking for normality
shapiro.test(filtered_data1$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  filtered_data1$StdResid
## W = 0.78783, p-value < 2.2e-16
  • H0: Errors are normally distributed

  • H1: Errors are not normally distributed

We can reject H0 (p<0.001). We can not assume normality of st. residuals. Considering we have a large sample of 1000 units, we don’t need to worry about that. Assumption 5 is met.

Assumption 6: Errors are independent

Same unit was observed only once, we do not have panel data. Assumption 6 is met.

Assumption 7: Homoskedasticity

#Checking for homoskedasticity

filtered_data1$StdFitted <- scale(fit2$fitted.values)
filtered_data1$StdResid <- round(rstandard(fit2), 3)

library(car)
scatterplot(y = filtered_data1$StdResid, x = filtered_data1$StdFitted,  
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            smooth = FALSE,
            regLine = FALSE,
            boxplots = FALSE)

As mentioned earlier and as we can also see from the graph, we are dealing with heteroskedasticity. Let’s confirm it with a test.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                   Data                   
##  ----------------------------------------
##  Response : Total.Amount 
##  Variables: fitted values of Total.Amount 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    219.2836 
##  Prob > Chi2   =    1.296152e-49
  • H0: Homoskedasticity is met.
  • H1: Homoskedasticity is violated.

Based on the sample data we reject H0 (p<0.001) and conclude that heteroscedasticity is present (Assumption 7 is not met). Because of that we should use White’s robust standard errors.

#Regression with White's robust standard errors
library(estimatr)
fit3 <- lm_robust(Total.Amount ~ Age + Price.per.Unit + Quantity + Product.CategoryFactor, 
           se_type = "HC1",
           data = filtered_data1)
summary(fit3)
## 
## Call:
## lm_robust(formula = Total.Amount ~ Age + Price.per.Unit + Quantity + 
##     Product.CategoryFactor, data = filtered_data1, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                                   Estimate Std. Error  t value
## (Intercept)                       -254.287   18.62053 -13.6563
## Age                                  0.081    0.20657   0.3921
## Price.per.Unit                       2.593    0.03986  65.0469
## Quantity                            99.239    6.46630  15.3471
## Product.CategoryFactorBeauty        -1.974    7.16868  -0.2754
## Product.CategoryFactorElectronics   10.898    6.89783   1.5800
##                                     Pr(>|t|)  CI Lower  CI Upper  DF
## (Intercept)                        8.280e-35 -290.8993 -217.6744 379
## Age                                6.952e-01   -0.3252    0.4872 379
## Price.per.Unit                    1.023e-207    2.5142    2.6710 379
## Quantity                           1.100e-41   86.5248  111.9534 379
## Product.CategoryFactorBeauty       7.832e-01  -16.0693   12.1215 379
## Product.CategoryFactorElectronics  1.149e-01   -2.6644   24.4612 379
## 
## Multiple R-squared:  0.969 , Adjusted R-squared:  0.9686 
## F-statistic:  1554 on 5 and 379 DF,  p-value: < 2.2e-16

Variables Age and dummy variable Product category are not statistically significant (p value is too high). We can not assume that age and product category have a significant effect on the total amount of a transaction.

Explanations:

  • Quantity coefficient: 99.239; If quantity increases for 1 unit (1 product), the total amount of a transaction on average increases for 99.24 monetary units, assuming that all other variables remain unchanged (p<0.001).

  • Price.per.Unit coefficient: 2.593: If price of a product increases for 1 monetary unit, the total amount of a transaction on average increases for 2.59 monetary units, assuming that all other variables remain unchanged (p<0.001).

  • Multiple R-squared: 0.969; 96.9% of variability of the total amount of a transaction is explained by variability of age, price per unit, quantity and product category.

  • F-statistics:

H0: beta1 = beta2 = beta3 = beta4 = 0

H1: At least one is different than 0.

We can reject H0 (p<0.001), and conclude that at least one of the variables have at least some impact on the total amount of a transaction.

3. Conclusion

Based on the sample data we can conclude that the quantity of items a customer buys and the price per product affects how much they spend in a transaction. However, the age of the customer and the type of a product they buy don’t have a statistically significant effect on the total amount of a transaction.