Homework 1

Metka Pintar

RQ: Does the total amount of a transaction differ between the gender?

1. Data

Source of the data: https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset/data

mydata <- read.table("./retail_sales_dataset.csv", header=TRUE, sep = ",", dec = ",") #Reading the data
head(mydata) #Showing first 6 rows of the data

##   Transaction.ID       Date Customer.ID Gender Age Product.Category
## 1              1 2023-11-24     CUST001   Male  34           Beauty
## 2              2 2023-02-27     CUST002 Female  26         Clothing
## 3              3 2023-01-13     CUST003   Male  50      Electronics
## 4              4 2023-05-21     CUST004   Male  37         Clothing
## 5              5 2023-05-06     CUST005   Male  30           Beauty
## 6              6 2023-04-25     CUST006 Female  45           Beauty
##   Quantity Price.per.Unit Total.Amount
## 1        3             50          150
## 2        2            500         1000
## 3        1             30           30
## 4        1            500          500
## 5        2             50          100
## 6        1             30           30

Unit of observation: 1 customer (not specified where, when; crafted data)

Sample size: 1000

Definition and units of variables:

Transaction.ID: A unique identifier for each transaction.
Date: The date when the transaction occurred (year, month, day).
Customer ID: A unique identifier for each customer.
Gender: The gender of the customer (Male,Female).
Age: The age of the customer (years).
Product.Category:The category of the purchased product (Electronics, Clothing, Beauty).
Quantity: The number of units of the product purchased.
Price.per.Unit: The price of one unit of the product(money unit is not specified).
Total.Amount: The total monetary value of the transaction (money unit is not specified).

2. Analysis

summary(mydata) #Showing the descriptive statistics

##  Transaction.ID       Date           Customer.ID       
##  Min.   :   1.0   Length:1000        Length:1000       
##  1st Qu.: 250.8   Class :character   Class :character  
##  Median : 500.5   Mode  :character   Mode  :character  
##  Mean   : 500.5                                        
##  3rd Qu.: 750.2                                        
##  Max.   :1000.0                                        
##     Gender               Age        Product.Category  
##  Length:1000        Min.   :18.00   Length:1000       
##  Class :character   1st Qu.:29.00   Class :character  
##  Mode  :character   Median :42.00   Mode  :character  
##                     Mean   :41.39                     
##                     3rd Qu.:53.00                     
##                     Max.   :64.00                     
##     Quantity     Price.per.Unit   Total.Amount 
##  Min.   :1.000   Min.   : 25.0   Min.   :  25  
##  1st Qu.:1.000   1st Qu.: 30.0   1st Qu.:  60  
##  Median :3.000   Median : 50.0   Median : 135  
##  Mean   :2.514   Mean   :179.9   Mean   : 456  
##  3rd Qu.:4.000   3rd Qu.:300.0   3rd Qu.: 900  
##  Max.   :4.000   Max.   :500.0   Max.   :2000

Explanation of the descriptive statistics:

Max Age = 64; The oldest customer was 64 years old.
Median Quantity = 3; Half of the customers purchased up to 3 products, the other half purchased more than 3 items.
Mean Total.Amount = 456; The average total monetary value of the transaction is 456 monetary units.
3rd quantile Price.per.Unit = 300; 75% of the products has a price per unit up to 300 units, the price per unit of the other 25% of the products is higher than 300 units.

mydata2 <- mydata[, c(4,9)] #Including only 4th and 9th column (only needed data)
head(mydata2)

##   Gender Total.Amount
## 1   Male          150
## 2 Female         1000
## 3   Male           30
## 4   Male          500
## 5   Male          100
## 6 Female           30

set.seed(1) #Setting initial point of sampling
mydata2 <- mydata2[sample(nrow(mydata2), 100, replace = TRUE),] #Random sample of 100 units

library(psych)
describeBy(mydata2$Total.Amount, g = mydata2$Gender)#Showing descriptive statistics separately for males and females

## 
##  Descriptive statistics by group 
## group: Female
##    vars  n   mean     sd median trimmed    mad min  max range skew
## X1    1 53 348.68 471.83    100  252.09 103.78  25 2000  1975 1.75
##    kurtosis    se
## X1     2.23 64.81
## ---------------------------------------------------- 
## group: Male
##    vars  n   mean     sd median trimmed    mad min  max range skew
## X1    1 47 424.79 527.89    100  342.18 103.78  30 2000  1970 1.25
##    kurtosis se
## X1     0.32 77

Positive skewness indicate that the distribution is asymmetrical to the right.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata2, aes(x = Total.Amount)) +
  geom_histogram(binwidth = 30, colour = "royalblue2", fill = "steelblue3") +
  facet_wrap(~Gender, ncol = 1) +
  ylab("Frequency") #Graphical presentation of the total amount transaction for males and females

From the graph we can also clearly see that both distributions are asymmetrical to the right.

Testing the hypothesis

Parametric test: Welch’s t-test

t.test(mydata2$Total.Amount ~ mydata2$Gender,
       paired = FALSE,
       var.equal = FALSE,
       alternative = "two.sided") #Independent t.test with Welch correction

## 
##  Welch Two Sample t-test
## 
## data:  mydata2$Total.Amount by mydata2$Gender
## t = -0.7562, df = 92.982, p-value = 0.4514
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -275.9706  123.7546
## sample estimates:
## mean in group Female   mean in group Male 
##             348.6792             424.7872

H0: mu_male - mu_female = 0
H1: mu_male - mu_female =/ 0

We can not reject H0 (p value: 0.4514 > 0.05). We can not say that there is a difference in the total transaction amount between males and females.

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata2 %>%
  group_by(Gender) %>%
  shapiro_test(Total.Amount) #Checking whether variable is normally distributed

## # A tibble: 2 × 4
##   Gender variable     statistic             p
##   <chr>  <chr>            <dbl>         <dbl>
## 1 Female Total.Amount     0.692 0.00000000300
## 2 Male   Total.Amount     0.740 0.0000000892

H0: Variable is normally distributed.
H1: Variable is not normally distributed.

We reject HO at p<0.001 and accept H1, variable total amount of a transaction is not normally distributed.

Checking the assumptions

1. Numerical variable -> YES (total amount of transaction is in money unit)
1. Normal distribution -> NO (not normally distributed based on Shapiro-Wilk normality test)
1. Variances of salaries are the same between both departments -> NO -> Welch correction

Since the assumption for normality is not met, we can’t use parametric test. Non-parametric test is more appropriate however, there is reduced statistical power compared to parametric tests.

Non-parametric test: Wilcoxon Rank Sum Test

wilcox.test(mydata2$Total.Amount ~ mydata2$Gender,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided") #Applying alternative non-parametrical test

## 
##  Wilcoxon rank sum test
## 
## data:  mydata2$Total.Amount by mydata2$Gender
## W = 1147, p-value = 0.4944
## alternative hypothesis: true location shift is not equal to 0

H0: Location distribution of total amount is the same for male and females.
H1: Location distribution of total amount is not the same for male and females.

We can not reject H0 (p value > 0.005). Based on the sample we can not say that location distribution of total amount is not the same for males and females.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(mydata2$Total.Amount ~ mydata2$Gender,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided")) #Calculating the effect size

## r (rank biserial) |        95% CI
## ---------------------------------
## -0.08             | [-0.30, 0.15]

interpret_rank_biserial(0.01) #Interpreting the effect size

## [1] "tiny"
## (Rules: funder2019)

3. Conclusion

Using the sample data, we are unable to say that the total amount of transactions differed between male and female customers (p value > 0.005). The effect size is tiny, r = 0.08; the difference is not significant.