RQ: Is there any difference in the total expenditure per transaction between men and women?
mydataa <- read.table("~/hw17/dataset.csv",
header = TRUE,
sep=",",
dec=".") #Creating the dataset
head(mydataa)
## Transaction.ID Date Customer.ID Gender Age Product.Category Quantity
## 1 1 2023-11-24 CUST001 Male 34 Beauty 3
## 2 2 2023-02-27 CUST002 Female 26 Clothing 2
## 3 3 2023-01-13 CUST003 Male 50 Electronics 1
## 4 4 2023-05-21 CUST004 Male 37 Clothing 1
## 5 5 2023-05-06 CUST005 Male 30 Beauty 2
## 6 6 2023-04-25 CUST006 Female 45 Beauty 1
## Price.per.Unit Total.Amount
## 1 50 150
## 2 500 1000
## 3 30 30
## 4 500 500
## 5 50 100
## 6 30 30
Explanation of dataset: This data presents retail information recorded, including most important drivers of retail operations, which are Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount. They help with predicting sales trends, demographic influences and purchased behaviors.
The data has 1000 observations with 8 variables, where:
Source of this dataset is from kaggle website:
Ranitsarkar. (2023, March 4). Yulu_Analysis-Hypothesis testing. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset
Unit of observation is a transaction (will be changed in cleaned dataset based on RQ).
This dataset is too big (1000) and has too many variables (irrelevant to the research question), so I am cleaning it and explaining the clean data below.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(1)
sampled_data <- mydataa %>% sample_n(size = 100, replace = FALSE) #random sampling the data to a sample with 100 units.
head(sampled_data)
## Transaction.ID Date Customer.ID Gender Age Product.Category Quantity
## 1 836 2023-04-19 CUST836 Female 22 Clothing 1
## 2 679 2023-01-11 CUST679 Female 18 Beauty 3
## 3 129 2023-04-23 CUST129 Female 21 Beauty 2
## 4 930 2023-05-10 CUST930 Male 54 Clothing 4
## 5 509 2023-06-26 CUST509 Female 37 Electronics 3
## 6 471 2023-03-23 CUST471 Male 32 Clothing 3
## Price.per.Unit Total.Amount
## 1 50 50
## 2 30 90
## 3 300 600
## 4 50 200
## 5 300 900
## 6 50 150
myrelevantdata <- sampled_data[, -c(2,3,5,6,7,8)] #removing the irrelevant variables
colnames(myrelevantdata) <- c("Transaction_ID", "Gender", "Total_Expenditure") #renaming the columns
myrelevantdata$Gender <- factor(myrelevantdata$Gender, levels = c("Male", "Female"), labels = c("Male", "Female")) #factoring gender variable
head(myrelevantdata)
## Transaction_ID Gender Total_Expenditure
## 1 836 Female 50
## 2 679 Female 90
## 3 129 Female 600
## 4 930 Male 200
## 5 509 Female 900
## 6 471 Male 150
Based on my RQ,: Unit of observation is transaction, variables are gender and total expenditure per purchase. H0: MeanF=MeanM H1: MeanF != MeanM
library(psych) #descriptive statistics by both female and male group. Checking the difference between both of their arithmetic means.
## Warning: package 'psych' was built under R version 4.3.2
describeBy(myrelevantdata$Total_Expenditure, g = myrelevantdata$Gender)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 48 406.46 509.99 110 322.25 118.61 30 2000 1970 1.39 0.81
## se
## X1 73.61
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 52 359.42 483.73 100 263.1 88.96 25 2000 1975 1.62 1.71
## se
## X1 67.08
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(myrelevantdata, aes(x= Total_Expenditure)) +
geom_histogram(binwidth = 160, colour= "blue") +
facet_wrap(~Gender, ncol = 1) +
ylab("Frequency") #checking for normality, outliers. Potential outlier identified, I added boxplot at the end to confirm outliers, which proves we need non-parametric test
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
myrelevantdata %>%
group_by(Gender) %>%
shapiro_test(Total_Expenditure) #parametrical test for independent samples
## # A tibble: 2 × 4
## Gender variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male Total_Expenditure 0.732 0.0000000521
## 2 Female Total_Expenditure 0.701 0.00000000536
H0: Variable is normally distributed H1: Variable is not normally distibuted.
We reject H0 at p<0,001 for both variables “Female” and “Male”. Normality assumption is violated and we need to use non-parametrical test.
t.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender, paired = FALSE, var.equal = FALSE, alternative = "two.sided") #t-test for parametric test with added Welch correction because of the same variance assumption in both of samples being often violated
##
## Welch Two Sample t-test
##
## data: myrelevantdata$Total_Expenditure by myrelevantdata$Gender
## t = 0.47228, df = 96.283, p-value = 0.6378
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -150.6440 244.7145
## sample estimates:
## mean in group Male mean in group Female
## 406.4583 359.4231
We cannot reject H0 at p=0,64. This means we cannot reject that the true difference in means between Male and Female is equal to 0. However, since assumptions are violated, we need to perform non-parametrical test.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender, pooled_sd = FALSE) #checking Cohens effect size, assuming equal variances now
## Cohen's d | 95% CI
## -------------------------
## 0.09 | [-0.30, 0.49]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.09, rules = "sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)
The effect size is tiny with d=0,09. However, since assumptions are violated, we need to perform non-parametrical test.
wilcox.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided") #doing non-parametrical test, because outlier assumption was violated (there are outliers)
##
## Wilcoxon rank sum test
##
## data: myrelevantdata$Total_Expenditure by myrelevantdata$Gender
## W = 1381, p-value = 0.3566
## alternative hypothesis: true location shift is not equal to 0
We cannot reject Ho at p=0,36. This means we cannot reject that the true difference in means between Male and Female is equal to 0.
effectsize(wilcox.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")) #checking effect size for Wilcoxon Rank Sum Test.
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.11 | [-0.12, 0.32]
interpret_rank_biserial(0.11)
## [1] "small"
## (Rules: funder2019)
The effect size is small with r=0,11.
#install.packages("ggpubr")
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.2
ggboxplot(myrelevantdata,
x= "Gender",
y= "Total_Expenditure",
add = "jitter") #proving there are outliers, therefore assumptions are violated and we need to use non-parametrical tests.
Conclusions: Based on the sample data, we are unable to tell if men and
women differ in the average total expenditure per transaction (p=0,36),
with effect size being small, r=0,11.