mydata <- read.table("./retail_sales_dataset.csv", header=TRUE, sep = ",", dec = ",") #Reading the data
head(mydata) #Showing first 6 rows of the data
## Transaction.ID Date Customer.ID Gender Age Product.Category
## 1 1 2023-11-24 CUST001 Male 34 Beauty
## 2 2 2023-02-27 CUST002 Female 26 Clothing
## 3 3 2023-01-13 CUST003 Male 50 Electronics
## 4 4 2023-05-21 CUST004 Male 37 Clothing
## 5 5 2023-05-06 CUST005 Male 30 Beauty
## 6 6 2023-04-25 CUST006 Female 45 Beauty
## Quantity Price.per.Unit Total.Amount
## 1 3 50 150
## 2 2 500 1000
## 3 1 30 30
## 4 1 500 500
## 5 2 50 100
## 6 1 30 30
Unit of observation: 1 customer (not specified where, when; crafted data)
Sample size: 1000
summary(mydata) #Showing the descriptive statistics
## Transaction.ID Date Customer.ID
## Min. : 1.0 Length:1000 Length:1000
## 1st Qu.: 250.8 Class :character Class :character
## Median : 500.5 Mode :character Mode :character
## Mean : 500.5
## 3rd Qu.: 750.2
## Max. :1000.0
## Gender Age Product.Category
## Length:1000 Min. :18.00 Length:1000
## Class :character 1st Qu.:29.00 Class :character
## Mode :character Median :42.00 Mode :character
## Mean :41.39
## 3rd Qu.:53.00
## Max. :64.00
## Quantity Price.per.Unit Total.Amount
## Min. :1.000 Min. : 25.0 Min. : 25
## 1st Qu.:1.000 1st Qu.: 30.0 1st Qu.: 60
## Median :3.000 Median : 50.0 Median : 135
## Mean :2.514 Mean :179.9 Mean : 456
## 3rd Qu.:4.000 3rd Qu.:300.0 3rd Qu.: 900
## Max. :4.000 Max. :500.0 Max. :2000
Explanation of the descriptive statistics:
mydata2 <- mydata[, c(4,9)] #Including only 4th and 9th column (only needed data)
head(mydata2)
## Gender Total.Amount
## 1 Male 150
## 2 Female 1000
## 3 Male 30
## 4 Male 500
## 5 Male 100
## 6 Female 30
set.seed(1) #Setting initial point of sampling
mydata2 <- mydata2[sample(nrow(mydata2), 100, replace = TRUE),] #Random sample of 100 units
library(psych)
describeBy(mydata2$Total.Amount, g = mydata2$Gender)#Showing descriptive statistics separately for males and females
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad min max range skew
## X1 1 53 348.68 471.83 100 252.09 103.78 25 2000 1975 1.75
## kurtosis se
## X1 2.23 64.81
## ----------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad min max range skew
## X1 1 47 424.79 527.89 100 342.18 103.78 30 2000 1970 1.25
## kurtosis se
## X1 0.32 77
Positive skewness indicate that the distribution is asymmetrical to the right.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata2, aes(x = Total.Amount)) +
geom_histogram(binwidth = 30, colour = "royalblue2", fill = "steelblue3") +
facet_wrap(~Gender, ncol = 1) +
ylab("Frequency") #Graphical presentation of the total amount transaction for males and females
From the graph we can also clearly see that both distributions are asymmetrical to the right.
Parametric test: Welch’s t-test
t.test(mydata2$Total.Amount ~ mydata2$Gender,
paired = FALSE,
var.equal = FALSE,
alternative = "two.sided") #Independent t.test with Welch correction
##
## Welch Two Sample t-test
##
## data: mydata2$Total.Amount by mydata2$Gender
## t = -0.7562, df = 92.982, p-value = 0.4514
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -275.9706 123.7546
## sample estimates:
## mean in group Female mean in group Male
## 348.6792 424.7872
We can not reject H0 (p value: 0.4514 > 0.05). We can not say that there is a difference in the total transaction amount between males and females.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata2 %>%
group_by(Gender) %>%
shapiro_test(Total.Amount) #Checking whether variable is normally distributed
## # A tibble: 2 × 4
## Gender variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Female Total.Amount 0.692 0.00000000300
## 2 Male Total.Amount 0.740 0.0000000892
We reject HO at p<0.001 and accept H1, variable total amount of a transaction is not normally distributed.
Since the assumption for normality is not met, we can’t use parametric test. Non-parametric test is more appropriate however, there is reduced statistical power compared to parametric tests.
Non-parametric test: Wilcoxon Rank Sum Test
wilcox.test(mydata2$Total.Amount ~ mydata2$Gender,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided") #Applying alternative non-parametrical test
##
## Wilcoxon rank sum test
##
## data: mydata2$Total.Amount by mydata2$Gender
## W = 1147, p-value = 0.4944
## alternative hypothesis: true location shift is not equal to 0
We can not reject H0 (p value > 0.005). Based on the sample we can not say that location distribution of total amount is not the same for males and females.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydata2$Total.Amount ~ mydata2$Gender,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")) #Calculating the effect size
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.08 | [-0.30, 0.15]
interpret_rank_biserial(0.01) #Interpreting the effect size
## [1] "tiny"
## (Rules: funder2019)
Using the sample data, we are unable to say that the total amount of transactions differed between male and female customers (p value > 0.005). The effect size is tiny, r = 0.08; the difference is not significant.