RakovecLana_homework

Homework 1

RQ: Is there any difference in the total expenditure per transaction between men and women?

mydataa <- read.table("~/hw17/dataset.csv",
                    header = TRUE,
                    sep=",",
                    dec=".") #Creating the dataset
head(mydataa)

##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1              1 2023-11-24     CUST001   Male  34           Beauty        3
## 2              2 2023-02-27     CUST002 Female  26         Clothing        2
## 3              3 2023-01-13     CUST003   Male  50      Electronics        1
## 4              4 2023-05-21     CUST004   Male  37         Clothing        1
## 5              5 2023-05-06     CUST005   Male  30           Beauty        2
## 6              6 2023-04-25     CUST006 Female  45           Beauty        1
##   Price.per.Unit Total.Amount
## 1             50          150
## 2            500         1000
## 3             30           30
## 4            500          500
## 5             50          100
## 6             30           30

Explanation of dataset: This data presents retail information recorded, including most important drivers of retail operations, which are Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount. They help with predicting sales trends, demographic influences and purchased behaviors.

The data has 1000 observations with 8 variables, where:

Transaction ID is a unique identifier for each transaction, allowing tracking and reference,
Date is the date when the transaction occurred, providing insights into sales trends over time,
Customer ID is unique identifier for each customer, enabling customer-centric analysis,
Gender is the gender of the customer (Male/Female), offering insights into gender-based purchasing patterns,
Age is the age of the customer, facilitating segmentation and exploration of age-related influences and is measured in years,
Product Category is the category of the purchased product (e.g., Electronics, Clothing, Beauty), helping understand product preferences,
Quantity is the number of units of the product purchased, contributing to insights on purchase volumes,
Price per Unit is the price of one unit of the product, aiding in calculations related to total spending and is measured in monetary units,
Total Amount is the total monetary value of the transaction, showcasing the financial impact of each purchase and is measured in monetary units.

Source of this dataset is from kaggle website:

Ranitsarkar. (2023, March 4). Yulu_Analysis-Hypothesis testing. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset

Unit of observation is a transaction (will be changed in cleaned dataset based on RQ).

This dataset is too big (1000) and has too many variables (irrelevant to the research question), so I am cleaning it and explaining the clean data below.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(1)
sampled_data <- mydataa %>% sample_n(size = 100, replace = FALSE) #random sampling the data to a sample with 100 units.

head(sampled_data)

##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1            836 2023-04-19     CUST836 Female  22         Clothing        1
## 2            679 2023-01-11     CUST679 Female  18           Beauty        3
## 3            129 2023-04-23     CUST129 Female  21           Beauty        2
## 4            930 2023-05-10     CUST930   Male  54         Clothing        4
## 5            509 2023-06-26     CUST509 Female  37      Electronics        3
## 6            471 2023-03-23     CUST471   Male  32         Clothing        3
##   Price.per.Unit Total.Amount
## 1             50           50
## 2             30           90
## 3            300          600
## 4             50          200
## 5            300          900
## 6             50          150

myrelevantdata <- sampled_data[, -c(2,3,5,6,7,8)] #removing the irrelevant variables
colnames(myrelevantdata) <- c("Transaction_ID", "Gender", "Total_Expenditure") #renaming the columns
myrelevantdata$Gender <- factor(myrelevantdata$Gender, levels = c("Male", "Female"), labels = c("Male", "Female")) #factoring gender variable
head(myrelevantdata)

##   Transaction_ID Gender Total_Expenditure
## 1            836 Female                50
## 2            679 Female                90
## 3            129 Female               600
## 4            930   Male               200
## 5            509 Female               900
## 6            471   Male               150

Based on my RQ,: Unit of observation is transaction, variables are gender and total expenditure per purchase. H0: MeanF=MeanM H1: MeanF != MeanM

library(psych) #descriptive statistics by both female and male group. Checking the difference between both of their arithmetic means.

## Warning: package 'psych' was built under R version 4.3.2

describeBy(myrelevantdata$Total_Expenditure, g = myrelevantdata$Gender)

## 
##  Descriptive statistics by group 
## group: Male
##    vars  n   mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 48 406.46 509.99    110  322.25 118.61  30 2000  1970 1.39     0.81
##       se
## X1 73.61
## ------------------------------------------------------------ 
## group: Female
##    vars  n   mean     sd median trimmed   mad min  max range skew kurtosis
## X1    1 52 359.42 483.73    100   263.1 88.96  25 2000  1975 1.62     1.71
##       se
## X1 67.08

Mean of total expenditure of Male is 406.46 monetary units, while the mean for Female it’s 359.42 monetary units.
we have 48 Male units and 52 Female units in the sample.
Minimum expenditure for Males is 30 and maximum is 2000 monetary units,
Minimum expenditure for Females is 25 monetary units and maximum is 2000 monetary units.
Distribution of Male total expenditure is skewed to the right with positive skewness value of 1,39, and has a platykurtic kurtosis with 0,81.
Distributionof Male total expenditure is skewed to the right with positive skewness value of 1,62, and has a a leptocurtic distribution with 1,71.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(myrelevantdata, aes(x= Total_Expenditure)) +
  geom_histogram(binwidth = 160, colour= "blue") +
  facet_wrap(~Gender, ncol = 1) +
  ylab("Frequency") #checking for normality, outliers. Potential outlier identified, I added boxplot at the end to confirm outliers, which proves we need non-parametric test

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.3.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

myrelevantdata %>%
group_by(Gender) %>%
shapiro_test(Total_Expenditure)  #parametrical test for independent samples

## # A tibble: 2 × 4
##   Gender variable          statistic             p
##   <fct>  <chr>                 <dbl>         <dbl>
## 1 Male   Total_Expenditure     0.732 0.0000000521 
## 2 Female Total_Expenditure     0.701 0.00000000536

H0: Variable is normally distributed H1: Variable is not normally distibuted.

We reject H0 at p<0,001 for both variables “Female” and “Male”. Normality assumption is violated and we need to use non-parametrical test.

t.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender, paired = FALSE, var.equal = FALSE, alternative = "two.sided") #t-test for parametric test with added Welch correction because of the same variance assumption in both of samples being often violated

## 
##  Welch Two Sample t-test
## 
## data:  myrelevantdata$Total_Expenditure by myrelevantdata$Gender
## t = 0.47228, df = 96.283, p-value = 0.6378
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -150.6440  244.7145
## sample estimates:
##   mean in group Male mean in group Female 
##             406.4583             359.4231

We cannot reject H0 at p=0,64. This means we cannot reject that the true difference in means between Male and Female is equal to 0. However, since assumptions are violated, we need to perform non-parametrical test.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender, pooled_sd = FALSE) #checking Cohens effect size, assuming equal variances now

## Cohen's d |        95% CI
## -------------------------
## 0.09      | [-0.30, 0.49]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.09, rules = "sawilowsky2009")

## [1] "tiny"
## (Rules: sawilowsky2009)

The effect size is tiny with d=0,09. However, since assumptions are violated, we need to perform non-parametrical test.

wilcox.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided") #doing non-parametrical test, because outlier assumption was violated (there are outliers)

## 
##  Wilcoxon rank sum test
## 
## data:  myrelevantdata$Total_Expenditure by myrelevantdata$Gender
## W = 1381, p-value = 0.3566
## alternative hypothesis: true location shift is not equal to 0

We cannot reject Ho at p=0,36. This means we cannot reject that the true difference in means between Male and Female is equal to 0.

effectsize(wilcox.test(myrelevantdata$Total_Expenditure ~ myrelevantdata$Gender,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")) #checking effect size for Wilcoxon Rank Sum Test.

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.11              | [-0.12, 0.32]

interpret_rank_biserial(0.11)

## [1] "small"
## (Rules: funder2019)

The effect size is small with r=0,11.

#install.packages("ggpubr")
library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.3.2

ggboxplot(myrelevantdata,
          x= "Gender",
          y= "Total_Expenditure",
          add = "jitter") #proving there are outliers, therefore assumptions are violated and we need to use non-parametrical tests.

Conclusions: Based on the sample data, we are unable to tell if men and women differ in the average total expenditure per transaction (p=0,36), with effect size being small, r=0,11.

RakovecLana_homework_1

2024-01-09

Lana Rakovec

Homework 1