#Showing first 6 rows of the data
mydata <- read.table("./retail_sales_dataset.csv", header=TRUE, sep = ",", dec = ",") #Reading the data
head(mydata)
## Transaction.ID Date Customer.ID Gender Age Product.Category
## 1 1 2023-11-24 CUST001 Male 34 Beauty
## 2 2 2023-02-27 CUST002 Female 26 Clothing
## 3 3 2023-01-13 CUST003 Male 50 Electronics
## 4 4 2023-05-21 CUST004 Male 37 Clothing
## 5 5 2023-05-06 CUST005 Male 30 Beauty
## 6 6 2023-04-25 CUST006 Female 45 Beauty
## Quantity Price.per.Unit Total.Amount
## 1 3 50 150
## 2 2 500 1000
## 3 1 30 30
## 4 1 500 500
## 5 2 50 100
## 6 1 30 30
Unit of observation: 1 customer (not specified where, when; crafted data)
Sample size: 1000
#Including only 5th, 6th, 7th, 8th and 9th column (only needed data)
mydata4 <- mydata[, c(5,6,7,8,9)]
head(mydata4)
## Age Product.Category Quantity Price.per.Unit Total.Amount
## 1 34 Beauty 3 50 150
## 2 26 Clothing 2 500 1000
## 3 50 Electronics 1 30 30
## 4 37 Clothing 1 500 500
## 5 30 Beauty 2 50 100
## 6 45 Beauty 1 30 30
#Creating a factor variable for categorical variable (product category)
mydata4$Product.CategoryFactor <- factor(mydata4$Product.Category,
levels = c("Clothing", "Beauty", "Electronics"),
labels = c("Clothing", "Beauty", "Electronics"))
head(mydata4, 6)
## Age Product.Category Quantity Price.per.Unit Total.Amount
## 1 34 Beauty 3 50 150
## 2 26 Clothing 2 500 1000
## 3 50 Electronics 1 30 30
## 4 37 Clothing 1 500 500
## 5 30 Beauty 2 50 100
## 6 45 Beauty 1 30 30
## Product.CategoryFactor
## 1 Beauty
## 2 Clothing
## 3 Electronics
## 4 Clothing
## 5 Beauty
## 6 Beauty
#Showing the descriptive statistics
summary(mydata4)
## Age Product.Category Quantity Price.per.Unit
## Min. :18.00 Length:1000 Min. :1.000 Min. : 25.0
## 1st Qu.:29.00 Class :character 1st Qu.:1.000 1st Qu.: 30.0
## Median :42.00 Mode :character Median :3.000 Median : 50.0
## Mean :41.39 Mean :2.514 Mean :179.9
## 3rd Qu.:53.00 3rd Qu.:4.000 3rd Qu.:300.0
## Max. :64.00 Max. :4.000 Max. :500.0
## Total.Amount Product.CategoryFactor
## Min. : 25 Clothing :351
## 1st Qu.: 60 Beauty :307
## Median : 135 Electronics:342
## Mean : 456
## 3rd Qu.: 900
## Max. :2000
Explanation of the descriptive statistics:
Regression function: Total Amount = b0 + b1Age + b2Quantity + b3Price.per.Unit + b4Product.Category(Dummy)
#Checking linearity
library(car)
## Loading required package: carData
scatterplotMatrix(mydata4[, c(5,1,3,4)],
smooth = FALSE)
From the scatterplot we can clearly see that there is no problem with linearity. Assumption 1 is met. But we can see that we have problem with hetersoskedasticity for variables Quantity and Price.per.Unit. Will check it later with the test.
I built the model using regression and I included all the important variables in the model. Assumption 2 is met.
Expected effects of explanatory variables on the dependent variable:
From the scatterplot we can see that explanatory variables has a wide range of different values. Additional requirement #1 is met.
#Checking for multicollinearity
fit <- lm(Total.Amount ~ Age + Price.per.Unit + Quantity + Product.CategoryFactor,
data = mydata4)
vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.004486 1 1.002241
## Price.per.Unit 1.002204 1 1.001102
## Quantity 1.001433 1 1.000716
## Product.CategoryFactor 1.003546 2 1.000885
VIF of Age, Price per unit and Quantity is <5 and VIF for Product category <√5. We have no issues with multicollinearity. Assumption 3 and additional requirement #3 are met.
#Checking for outliers and units with high influence.
mydata4$StdResid <- round(rstandard(fit), 3)
mydata4$CooksD <- round(cooks.distance(fit), 3)
hist(mydata4$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
All values are between -3 and +3, there are no outliers.
hist(mydata4$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cook's distances")
The Cook’s distances are below 1, but there is one gap. I have to remove units with high impact.
#Identifying and excluding units with Cook's distance above 0.004
outliers <- which(mydata4$CooksD > 0.004)
filtered_data <- mydata4[-outliers, ]
#Refit the model without the units with high impact
fit1 <- lm(Total.Amount ~ Age + Price.per.Unit + Quantity + Product.CategoryFactor,
data = mydata4)
#Checking the histogram again if there are any more units with high impact
hist(filtered_data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cook's distances")
We still have 2 gaps. We have to remove the units.
#Identifying and excluding units with Cook's distance above 0.0005
outliers <- which(mydata4$CooksD > 0.0005)
filtered_data1 <- mydata4[-outliers, ]
#Refit the model without the units with high impact
fit2 <- lm(Total.Amount ~ Age + Price.per.Unit + Quantity + Product.CategoryFactor,
data = filtered_data1)
#Checking the histogram again
hist(filtered_data1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cook's distances")
We do not have any outliers, high-impact units were removed. Additional requirement #4 is met.
#Checking for normality
shapiro.test(filtered_data1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: filtered_data1$StdResid
## W = 0.78783, p-value < 2.2e-16
H0: Errors are normally distributed
H1: Errors are not normally distributed
We can reject H0 (p<0.001). We can not assume normality of st. residuals. Considering we have a large sample of 1000 units, we don’t need to worry about that. Assumption 5 is met.
Same unit was observed only once, we do not have panel data. Assumption 6 is met.
#Checking for homoskedasticity
filtered_data1$StdFitted <- scale(fit2$fitted.values)
filtered_data1$StdResid <- round(rstandard(fit2), 3)
library(car)
scatterplot(y = filtered_data1$StdResid, x = filtered_data1$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
smooth = FALSE,
regLine = FALSE,
boxplots = FALSE)
As mentioned earlier and as we can also see from the graph, we are dealing with heteroskedasticity. Let’s confirm it with a test.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------------
## Response : Total.Amount
## Variables: fitted values of Total.Amount
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 219.2836
## Prob > Chi2 = 1.296152e-49
Based on the sample data we reject H0 (p<0.001) and conclude that heteroscedasticity is present (Assumption 7 is not met). Because of that we should use White’s robust standard errors.
#Regression with White's robust standard errors
library(estimatr)
fit3 <- lm_robust(Total.Amount ~ Age + Price.per.Unit + Quantity + Product.CategoryFactor,
se_type = "HC1",
data = filtered_data1)
summary(fit3)
##
## Call:
## lm_robust(formula = Total.Amount ~ Age + Price.per.Unit + Quantity +
## Product.CategoryFactor, data = filtered_data1, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -254.287 18.62053 -13.6563
## Age 0.081 0.20657 0.3921
## Price.per.Unit 2.593 0.03986 65.0469
## Quantity 99.239 6.46630 15.3471
## Product.CategoryFactorBeauty -1.974 7.16868 -0.2754
## Product.CategoryFactorElectronics 10.898 6.89783 1.5800
## Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 8.280e-35 -290.8993 -217.6744 379
## Age 6.952e-01 -0.3252 0.4872 379
## Price.per.Unit 1.023e-207 2.5142 2.6710 379
## Quantity 1.100e-41 86.5248 111.9534 379
## Product.CategoryFactorBeauty 7.832e-01 -16.0693 12.1215 379
## Product.CategoryFactorElectronics 1.149e-01 -2.6644 24.4612 379
##
## Multiple R-squared: 0.969 , Adjusted R-squared: 0.9686
## F-statistic: 1554 on 5 and 379 DF, p-value: < 2.2e-16
Variables Age and dummy variable Product category are not statistically significant (p value is too high). We can not assume that age and product category have a significant effect on the total amount of a transaction.
Explanations:
Quantity coefficient: 99.239; If quantity increases for 1 unit (1 product), the total amount of a transaction on average increases for 99.24 monetary units, assuming that all other variables remain unchanged (p<0.001).
Price.per.Unit coefficient: 2.593: If price of a product increases for 1 monetary unit, the total amount of a transaction on average increases for 2.59 monetary units, assuming that all other variables remain unchanged (p<0.001).
Multiple R-squared: 0.969; 96.9% of variability of the total amount of a transaction is explained by variability of age, price per unit, quantity and product category.
F-statistics:
H0: beta1 = beta2 = beta3 = beta4 = 0
H1: At least one is different than 0.
We can reject H0 (p<0.001), and conclude that at least one of the variables have at least some impact on the total amount of a transaction.
Based on the sample data we can conclude that the quantity of items a customer buys and the price per product affects how much they spend in a transaction. However, the age of the customer and the type of a product they buy don’t have a statistically significant effect on the total amount of a transaction.