RQ: How does the total expenditure per transaction depend on the quantity of products purchased, the price per unit in the purchase and the gender?
Follow up question: Does adding a new variable-the category of a product (Electronics, Clothing, or Beauty) make any significant difference in explaining the total expenditure of the transaction?
mydataaa <- read.table("~/hw17/retail_sales_dataset_manipulatedd.csv",
header = TRUE,
sep=";",
dec=".") #Creating the dataset
head(mydataaa[, -c(2,5,9,10,11,12)])
## Transaction.ID Customer.ID Gender Product.Category Quantity Price.per.Unit
## 1 1 CUST002 Female Clothing 2 500
## 2 2 CUST006 Female Beauty 1 30
## 3 3 CUST010 Female Clothing 4 50
## 4 4 CUST015 Female Electronics 4 500
## 5 5 CUST017 Female Clothing 4 25
## 6 6 CUST018 Female Electronics 2 25
## Total
## 1 2400
## 2 72
## 3 480
## 4 4800
## 5 240
## 6 120
Explanation of dataset: This data presents retail information recorded, including most important drivers of retail operations, which are Transaction ID, Customer ID, Gender, Product Category, Quantity, and Total Amount. They help with predicting sales trends, demographic influences and purchased behaviors.
The data has 1000 observations with 8 variables, where:
Source of this dataset is from kaggle website:
Ranitsarkar. (2023, March 4). Yulu_Analysis-Hypothesis testing. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset
Unit of observation is a customer.
This dataset is too big (1000) and many outliers, so I will sample it to have less outliers to remove and has too many variables (irrelevant to the research question), so I am cleaning it and explaining the clean data below.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(1)
mydataa <- mydataaa %>% sample_n(size = 500, replace = FALSE) #random sampling the data to a sample with 500 units.
mydata <- mydataa[, -c(2,3,5,9,10,11,12,14,15,16)] #removing the irrelevant variables
mydata$Gender <- factor(mydata$Gender,
levels = c("Male", "Female"),
labels = c("Male", "Female"))
mydata$Product.Category <- factor(mydata$Product.Category,
levels = c("Beauty", "Electronics", "Clothing"),
labels = c("Beauty", "Electronics", "Clothing"))
Factoring out the categorical variables.
any(is.na(mydata))
## [1] FALSE
There are no N/As.
summary(mydata[c("Total", "Quantity","Product.Category", "Price.per.Unit", "Gender")])
## Total Quantity Product.Category Price.per.Unit
## Min. : 25.0 Min. :1.00 Beauty :145 Min. : 25.0
## 1st Qu.: 100.0 1st Qu.:1.00 Electronics:179 1st Qu.: 30.0
## Median : 240.0 Median :2.00 Clothing :176 Median : 50.0
## Mean : 724.7 Mean :2.47 Mean :170.5
## 3rd Qu.: 900.0 3rd Qu.:3.00 3rd Qu.:300.0
## Max. :4800.0 Max. :4.00 Max. :500.0
## Gender
## Male :250
## Female:250
##
##
##
##
There are eaxctly 250 males and exactly 250 females in this random sample.
Out of three product categories, 145 customers purchased Beauty products, 179 purchased 179 products and 176 purchased 176 products.
Customers on average purchased 2,47 products per order, but maybe more appropriate explanation here is that most customers purchased 2 products per order.
Average price per unit purchased was 170,5 monetary units, with maximum price being 500 and minimum 25 monetary units. 25% of units purchased had a price of less than 30 monetary units, and 75% costed 300 or more monetary units. Most of them costed 50 monetary units.
Customers spent on average 724,7 monetary units per order. 25% of customers spent less than 100 monetary units, with minimum spending per transaction being 25 monetary units. 75% of customers spent at least 900 monetary units with maximum spending being 4800 monetary units.
Assumptions:
Requirements:
The dependent variable is numerical, the explanatory variables can be numerical or dichotomous. This requirement is fulfilled. Each explanatory variable must vary. None of the variables has nonzero variance so this assumption is met as well. Outliers and units of high impact are removed from the data. Not too strong multicolinearity.
Letβs fit the first model and check assumptions and requirements.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata[ , c(-1,-2,-3,-7,-8)], #Scatterplot matrix
smooth = FALSE)
From the graph and the feedback from my last homework, I am not sure if
the linearity is met. From the scatterplots among explanatory variables
I assume that multicolineatity isnβt a problem, but I will aslo check it
with variance inflation factor VIF.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -2,-3,-7,-8)])) #Correlation matrix
## Quantity Price.per.Unit Total
## Quantity 1.00 0.04 0.35
## Price.per.Unit 0.04 1.00 0.76
## Total 0.35 0.76 1.00
##
## n= 500
##
##
## P
## Quantity Price.per.Unit Total
## Quantity 0.3304 0.0000
## Price.per.Unit 0.3304 0.0000
## Total 0.0000 0.0000
There could be statistically significant correlation between Price per unit and Quantity, so letβs check VIF for the multicolinearity:
library(car)
fit1 <- lm(Total ~ Quantity + Price.per.Unit + Product.Category +Gender,
data = mydata)
vif(fit1)
## GVIF Df GVIF^(1/(2*Df))
## Quantity 1.005027 1 1.002511
## Price.per.Unit 1.002417 1 1.001208
## Product.Category 1.000857 2 1.000214
## Gender 1.003953 1 1.001975
mean(vif(fit1))
## [1] 1.084847
As predicted, all the values are below 5, and the mean of variance inflation factor is close to one.
No perfect multicolineartity assumption is met, as well as the not too strong mulicolinearity requirement.
mydata$stdResid <- round(rstandard(fit1), 3) #standardized residuals
mydata$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances
hist (mydata$stdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
Letβs check if they are normally distributed with SHapiro Wilk normality
test even though it wouldnβt matter as I have big enough (500) sample
size.
shapiro.test(mydata$stdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$stdResid
## W = 0.91288, p-value = 2.339e-16
H0: Errors are normally distributed.
H1: Errors are not normally distributed.
We reject the null hypothesis (p<0.001) and conclude that errors are not normally distributed, however, this does not matter since the sample size is sufficiently big.
Also, from the histogram of standardized residuals, it can be seen that there are some outlier since some values are smaller than -3 and higher than 3. We will filter those units out later.
Letβs check outliers with cooks distances:
hist(mydata$CooksD,
xlab = "Cooks DIstances",
ylab = "Frequency",
main = "Histogram of Cooks distances")
No values above 1, but I can see some gaps, so there are probably some
units with high impact, which I will remove:
head(mydata[order(mydata$stdResid), c("Transaction.ID","stdResid")],3) #Highest values of standardized residuals
## Transaction.ID stdResid
## 80 843 -1.846
## 235 955 -1.846
## 262 553 -1.846
looks ok, because they are all <3
head(mydata[order(-mydata$stdResid), c("Transaction.ID", "stdResid")], 11) #Lowest values of standardized residuals
## Transaction.ID stdResid
## 22 37 3.971
## 431 69 3.971
## 11 307 3.955
## 168 247 3.955
## 214 140 3.955
## 119 290 3.886
## 135 214 3.886
## 155 327 3.886
## 215 126 3.886
## 237 309 3.886
## 91 29 2.151
First 10 units have std res values above 3, so I will remove them
head(mydata[order(-mydata$CooksD), c("Transaction.ID", "CooksD")], 11)
## Transaction.ID CooksD
## 11 307 0.049
## 168 247 0.049
## 214 140 0.049
## 22 37 0.046
## 431 69 0.046
## 119 290 0.044
## 135 214 0.044
## 155 327 0.044
## 215 126 0.044
## 237 309 0.044
## 351 25 0.012
There is a big difference between 10th and 11th unitβs cooks distance. Units with Transaction ID 307,247,140,37,69,290,214,327,126,309 are going out.
mydata1 <- mydata %>%
filter(!Transaction.ID %in% c(307,247,140,37,69,290,214,327,126,309))
head(mydata1[order(-mydata1$stdResid),], 3)
## Transaction.ID Gender Product.Category Quantity Price.per.Unit Total
## 89 29 Female Electronics 3 500 3600
## 114 316 Female Electronics 3 500 3600
## 176 65 Female Electronics 3 500 3600
## stdResid CooksD
## 89 2.151 0.011
## 114 2.151 0.011
## 176 2.151 0.011
Dataset with 10 removed units.
library(car)
scatterplotMatrix(mydata1[ , c(-1,-2,-3,-7,-8,-9)], #Scatterplot matrix
smooth = FALSE)
Everything looks the same as without the removed units.
fit0 <- lm (Total ~ Quantity + Price.per.Unit + Product.Category + Gender,
data = mydata1)
mydata1$stdFitted <- scale(fit0$fitted.values)
library(car)
scatterplot(stdResid ~ stdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
smooth = FALSE,
regLine = FALSE,
boxplots = FALSE,
data=mydata1)
I can see heteroscendasticity (homoscedasticity being violated), but
letβs also check it with Breusch Pagan test:
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit0)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Total
## Variables: fitted values of Total
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 212.5306
## Prob > Chi2 = 3.852833e-48
H0: The variance is constant (homoskedasticity
H1: The variance is not constant - (heteroskedasticity)
We reject the H0 at p>0,001 and conclude the assumption of variance being constant is violated. This means I will need to use robust standard errors, because they are biased.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(mydata1, aes(y = Total, x = Quantity)) +
geom_point(colour = "darkred") +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
ylab("Total Amount of money spent per order") +
xlab("Quantity of products purchased") +
theme_dark() +
geom_text(
label = mydata1$Transaction.ID,
nudge_x = 0.50, nudge_y = 0.3,
check_overlap = T, size = 2.5) +
geom_hline(yintercept = mean(mydata1$Total), linetype = "dashed", colour = "blue") +
geom_vline(xintercept = mean(mydata1$Quantity), linetype = "dashed", colour = "blue")
Here I just wanted to check how much of the differences between
estimated and fitted values is explained and explained for the
independent variable Quantity ofr the dependend variable Total Value of
the purchase. For example, for unit #597, the distance above the orange
line and the red dot is unexplained, and the distance between the blue
line and orange line is explained. Both of them combined-the vertical
distance between the blue line and the red dot is the entire deviation
of this individual unit. This is just for better vizualisation, it is
not one of the steps of the regression analysis.
library(ggplot2)
ggplot(mydata1, aes(y = Total, x = Price.per.Unit)) +
geom_point(colour = "darkred") +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
ylab("Total Amount of money spent per order") +
xlab("Price per unit") +
theme_dark() +
geom_text(
label = mydata1$Price.per.Unit,
nudge_x = 0.50, nudge_y = 0.3,
check_overlap = T, size = 2.5) +
geom_hline(yintercept = mean(mydata1$Total), linetype = "dashed", colour = "blue") +
geom_vline(xintercept = mean(mydata1$Price.per.Unit), linetype = "dashed", colour = "blue")
This is just for better vizualisation, it is not one of the steps of the
regression analysis.
library(ggplot2)
ggplot(mydata1, aes(y = Total, x = Quantity, color = Gender)) +
geom_point(aes(shape = Gender)) +
scale_color_manual(values = c("Male" = "darkorange", "Female" = "darkred")) +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
ylab("Total amount") +
xlab("Quantity") +
theme_bw() +
geom_text(
label = mydata1$Transaction.ID,
nudge_x = 0.50, nudge_y = 0.3,
check_overlap = T, size = 2.5) +
labs(color = "Gender", shape = "Gender")
Here, I just wanted to present how much on average Females spend more in
a purchase than Males do (we capture this with a dummy variable)(only
quantity is given here, not Price per unit). This is just for better
vizualisation, it is not one of the steps of the regression
analysis.
Letβs continue the regression analysis:
fit1 <- lm(Total ~ Quantity + Price.per.Unit + Gender ,
data = mydata1)
summary(fit1)
##
## Call:
## lm(formula = Total ~ Quantity + Price.per.Unit + Gender, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -839.7 -293.3 -120.5 243.4 1344.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -740.9989 51.7627 -14.31 <2e-16 ***
## Quantity 223.0683 16.8488 13.24 <2e-16 ***
## Price.per.Unit 3.7152 0.1054 35.26 <2e-16 ***
## GenderFemale 469.6378 37.6797 12.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 416.2 on 486 degrees of freedom
## Multiple R-squared: 0.7594, Adjusted R-squared: 0.7579
## F-statistic: 511.4 on 3 and 486 DF, p-value: < 2.2e-16
We have to use robust standard errors to get correct regression, since homoscedasticity is violated:
library(estimatr)
fit5 <- lm_robust(Total ~ Quantity + Price.per.Unit + Gender ,
se_type="HC1",
data = mydata1)
summary(fit5)
##
## Call:
## lm_robust(formula = Total ~ Quantity + Price.per.Unit + Gender,
## data = mydata1, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -740.999 59.0754 -12.54 1.837e-31 -857.074 -624.924 486
## Quantity 223.068 17.7123 12.59 1.132e-31 188.266 257.870 486
## Price.per.Unit 3.715 0.1416 26.24 3.732e-95 3.437 3.993 486
## GenderFemale 469.638 38.3170 12.26 2.781e-30 394.350 544.925 486
##
## Multiple R-squared: 0.7594 , Adjusted R-squared: 0.7579
## F-statistic: 315 on 3 and 486 DF, p-value: < 2.2e-16
H0: Beta(i) = 0
H1: Beta(i) =/= 0 With all p values>0,001, we can reject the null hypothesis at statistical significance. We have enough statistical evidence that actual regression coefficient Beta differs from 0.
H1: π½1 =/= 0 We reject H0 at p>0,001. Quantity of product purchased per purchase has a significant effect on total value of the purchase. If quantity of products purchased in a purchase increases by one unit (product), the total value of the purchase on average increases by 223,07 monetary units at p>0,001, assuming all other explanatory variables remain the same.
H1: π½2 =/= 0 We reject H0 at p>0,001. Price per unit of the products purchased in the purchase has a significant effect on total value of the purchase. If price per unit of the products purchased increases by one monetary unit, the total value of the purchase on average increases by 3,72 monetary units at p>0,001, assuming all other explanatory variables remain the same.
H1: π½3 =/= 0 We reject H0 at p>0,001. Gender has a significant effect on total value of the purchase. Given values of all other explanatory variables, females will on average spend 469,64 monetary units more on a purchase than male respondends.
Coefficient of determination R - squared = 0.7594 75,94% of the variability of the total value of the pruchase is explained by the variability (or linear effect) of all other variables, included in this model (Quantity, price per unit and gender).
Test of Significance of regression F-statistics (ANOVA) H0: π2 = 0 or π½1 = π½2 = π½3 = 0
H1: π2 > 0 or at least one π½π is different from 0
We reject the null hypothesis at p<0.001 and conclude that at least some of the explanatory variables impact the total value of a purchase.
fit3 <- lm_robust(Total ~ Quantity + Price.per.Unit + Product.Category + Gender ,
se_type="HC1",
data = mydata1)
summary(fit3)
##
## Call:
## lm_robust(formula = Total ~ Quantity + Price.per.Unit + Product.Category +
## Gender, data = mydata1, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) -752.573 67.6124 -11.1307 8.867e-26 -885.423
## Quantity 223.074 17.7345 12.5785 1.362e-31 188.227
## Price.per.Unit 3.715 0.1417 26.2248 5.619e-95 3.437
## Product.CategoryElectronics 15.015 45.9296 0.3269 7.439e-01 -75.231
## Product.CategoryClothing 17.921 43.6075 0.4110 6.813e-01 -67.763
## GenderFemale 469.394 38.3479 12.2404 3.351e-30 394.045
## CI Upper DF
## (Intercept) -619.723 484
## Quantity 257.920 484
## Price.per.Unit 3.994 484
## Product.CategoryElectronics 105.261 484
## Product.CategoryClothing 103.604 484
## GenderFemale 544.743 484
##
## Multiple R-squared: 0.7595 , Adjusted R-squared: 0.757
## F-statistic: 193.7 on 5 and 484 DF, p-value: < 2.2e-16
Even though adding a variable of Product category, which is statistically significant, the model did not improve as itβs still explaining 76% of the data, so we do not include it in our regression model and go for the simpler one without it.
sqrt(summary(fit5)$r.squared)
## [1] 0.8714526
Multiple regression coefficient - the square root of the R squared tells us the relationship between all the variables in the model(simpler one without the category of the products added).
The relationship between the dependent variable total value of the purchase, and quantity, price per unit, and gender (explanatory) variables is strong.
We do not include sign here (positive/negative) since it has many dimensions (relationship between 4 variables).