Lana Rakovec

Homework 1

RQ: Is there any difference in the total expenditure per transaction between men and women?

mydataa <- read.table("~/hw17/dataset.csv",
                    header = TRUE,
                    sep=",",
                    dec=".") #Creating the dataset
head(mydataa)
##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1              1 2023-11-24     CUST001   Male  34           Beauty        3
## 2              2 2023-02-27     CUST002 Female  26         Clothing        2
## 3              3 2023-01-13     CUST003   Male  50      Electronics        1
## 4              4 2023-05-21     CUST004   Male  37         Clothing        1
## 5              5 2023-05-06     CUST005   Male  30           Beauty        2
## 6              6 2023-04-25     CUST006 Female  45           Beauty        1
##   Price.per.Unit Total.Amount
## 1             50          150
## 2            500         1000
## 3             30           30
## 4            500          500
## 5             50          100
## 6             30           30

Explanation of dataset: This data presents retail information recorded, including most important drivers of retail operations, which are Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount. They help with predicting sales trends, demographic influences and purchased behaviors.

The data has 1000 observations with 8 variables, where:

Source of this dataset is from kaggle website:

Ranitsarkar. (2023, March 4). Yulu_Analysis-Hypothesis testing. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset

Unit of observation is a transaction (will be changed in cleaned dataset based on RQ).

This dataset is too big (1000) and has too many variables (irrelevant to the research question), so I am cleaning it and explaining the clean data below.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(1)
sampled_data <- mydataa %>% sample_n(size = 100, replace = FALSE) #random sampling the data to a sample with 100 units.
head(sampled_data)
##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1            836 2023-04-19     CUST836 Female  22         Clothing        1
## 2            679 2023-01-11     CUST679 Female  18           Beauty        3
## 3            129 2023-04-23     CUST129 Female  21           Beauty        2
## 4            930 2023-05-10     CUST930   Male  54         Clothing        4
## 5            509 2023-06-26     CUST509 Female  37      Electronics        3
## 6            471 2023-03-23     CUST471   Male  32         Clothing        3
##   Price.per.Unit Total.Amount
## 1             50           50
## 2             30           90
## 3            300          600
## 4             50          200
## 5            300          900
## 6             50          150
myrelevantdata <- sampled_data[, -c(2,3,5,6,7,8)] #removing the irrelevant variables
colnames(myrelevantdata) <- c("Transaction_ID", "Gender", "Total_Expenditure") #renaming the columns
myrelevantdata$Gender <- factor(myrelevantdata$Gender, levels = c("Male", "Female"), labels = c("Male", "Female")) #factoring gender variable
head(myrelevantdata)
##   Transaction_ID Gender Total_Expenditure
## 1            836 Female                50
## 2            679 Female                90
## 3            129 Female               600
## 4            930   Male               200
## 5            509 Female               900
## 6            471   Male               150

Based on my RQ,: Unit of observation is transaction, variables are gender and total expenditure per purchase. H0: MeanF=MeanM H1: MeanF != MeanM

library(psych) #descriptive statistics by both female and male group. Checking the difference between both of their arithmetic means.
## Warning: package 'psych' was built under R version 4.3.2
describeBy(myrelevantdata$Total_Expenditure, g = myrelevantdata$Gender)
## 
##  Descriptive statistics by group 
## group: Male
##    vars  n   mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 48 406.46 509.99    110  322.25 118.61  30 2000  1970 1.39     0.81
##       se
## X1 73.61
## ------------------------------------------------------------ 
## group: Female
##    vars  n   mean     sd median trimmed   mad min  max range skew kurtosis
## X1    1 52 359.42 483.73    100   263.1 88.96  25 2000  1975 1.62     1.71
##       se
## X1 67.08
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(myrelevantdata, aes(x= Total_Expenditure)) +
  geom_histogram(binwidth = 160, colour= "blue") +
  facet_wrap(~Gender, ncol = 1) +
  ylab("Frequency") #checking for normality, outliers. Potential outlier identified, I added boxplot at the end to confirm outliers, which proves we need non-parametric test

library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
myrelevantdata %>%
group_by(Gender) %>%
shapiro_test(Total_Expenditure)  #parametrical test for independent samples 
## # A tibble: 2 × 4
##   Gender variable          statistic             p
##   <fct>  <chr>                 <dbl>         <dbl>
## 1 Male   Total_Expenditure     0.732 0.0000000521 
## 2 Female Total_Expenditure     0.701 0.00000000536

H0: Variable is normally distributed H1: Variable is not normally distibuted.

We reject H0 at p<0,001 for both variables “Female” and “Male”. Normality assumption is violated and we need to use non-parametrical test.

t.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender, paired = FALSE, var.equal = FALSE, alternative = "two.sided") #t-test for parametric test with added Welch correction because of the same variance assumption in both of samples being often violated
## 
##  Welch Two Sample t-test
## 
## data:  myrelevantdata$Total_Expenditure by myrelevantdata$Gender
## t = 0.47228, df = 96.283, p-value = 0.6378
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -150.6440  244.7145
## sample estimates:
##   mean in group Male mean in group Female 
##             406.4583             359.4231

We cannot reject H0 at p=0,64. This means we cannot reject that the true difference in means between Male and Female is equal to 0. However, since assumptions are violated, we need to perform non-parametrical test.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cohens_d(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender, pooled_sd = FALSE) #checking Cohens effect size, assuming equal variances now
## Cohen's d |        95% CI
## -------------------------
## 0.09      | [-0.30, 0.49]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.09, rules = "sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)

The effect size is tiny with d=0,09. However, since assumptions are violated, we need to perform non-parametrical test.

wilcox.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided") #doing non-parametrical test, because outlier assumption was violated (there are outliers)
## 
##  Wilcoxon rank sum test
## 
## data:  myrelevantdata$Total_Expenditure by myrelevantdata$Gender
## W = 1381, p-value = 0.3566
## alternative hypothesis: true location shift is not equal to 0

We cannot reject Ho at p=0,36. This means we cannot reject that the true difference in means between Male and Female is equal to 0.

effectsize(wilcox.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")) #checking effect size for Wilcoxon Rank Sum Test.
## r (rank biserial) |        95% CI
## ---------------------------------
## 0.11              | [-0.12, 0.32]
interpret_rank_biserial(0.11)
## [1] "small"
## (Rules: funder2019)

The effect size is small with r=0,11.

#install.packages("ggpubr")
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.2
ggboxplot(myrelevantdata,
          x= "Gender",
          y= "Total_Expenditure",
          add = "jitter") #proving there are outliers, therefore assumptions are violated and we need to use non-parametrical tests.

Conclusions: Based on the sample data, we are unable to tell if men and women differ in the average total expenditure per transaction (p=0,36), with effect size being small, r=0,11.