RQ1: Is there a correlation between the customer’s age and their total spending per transaction?
mydata <- read.table("~/hw17/dataset.csv",
header = TRUE,
sep=",",
dec=".") #Creating the dataset
head(mydata)
## Transaction.ID Date Customer.ID Gender Age Product.Category Quantity
## 1 1 2023-11-24 CUST001 Male 34 Beauty 3
## 2 2 2023-02-27 CUST002 Female 26 Clothing 2
## 3 3 2023-01-13 CUST003 Male 50 Electronics 1
## 4 4 2023-05-21 CUST004 Male 37 Clothing 1
## 5 5 2023-05-06 CUST005 Male 30 Beauty 2
## 6 6 2023-04-25 CUST006 Female 45 Beauty 1
## Price.per.Unit Total.Amount
## 1 50 150
## 2 500 1000
## 3 30 30
## 4 500 500
## 5 50 100
## 6 30 30
Explanation of dataset: This data presents retail information recorded, including most important drivers of retail operations, which are Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount. They help with predicting sales trends, demographic influences and purchased behaviors.
The data has 1000 observations with 8 variables, where:
Source of this dataset is from kaggle website:
Ranitsarkar. (2023, March 4). Yulu_Analysis-Hypothesis testing. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset
Unit of observation is a transaction (will be changed in cleaned dataset based on RQ).
This dataset is too big (1000) and has too many variables (irrelevant to the research question), so I am cleaning it and explaining the clean data below.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(1)
sampled_data <- mydata %>% sample_n(size = 500, replace = FALSE) #random sampling the data to a sample with 500 units.
head(sampled_data)
## Transaction.ID Date Customer.ID Gender Age Product.Category Quantity
## 1 836 2023-04-19 CUST836 Female 22 Clothing 1
## 2 679 2023-01-11 CUST679 Female 18 Beauty 3
## 3 129 2023-04-23 CUST129 Female 21 Beauty 2
## 4 930 2023-05-10 CUST930 Male 54 Clothing 4
## 5 509 2023-06-26 CUST509 Female 37 Electronics 3
## 6 471 2023-03-23 CUST471 Male 32 Clothing 3
## Price.per.Unit Total.Amount
## 1 50 50
## 2 30 90
## 3 300 600
## 4 50 200
## 5 300 900
## 6 50 150
myrelevantdata <- sampled_data[, -c(2,3,4,6,7,8)] #removing the irrelevant variables
colnames(myrelevantdata) <- c("Transaction_ID", "Age", "Total_Expenditure") #renaming the columns
head(myrelevantdata)
## Transaction_ID Age Total_Expenditure
## 1 836 22 50
## 2 679 18 90
## 3 129 21 600
## 4 930 54 200
## 5 509 37 900
## 6 471 32 150
Based on my RQ,: Unit of observation is a transaction, variables are age and total expenditure per purchase. H0: correlation on population equals 0 H1: correlation on population does not equal 0
summary(myrelevantdata)
## Transaction_ID Age Total_Expenditure
## Min. : 1.0 Min. :18.00 Min. : 25.00
## 1st Qu.:270.8 1st Qu.:29.00 1st Qu.: 71.25
## Median :510.5 Median :42.00 Median : 120.00
## Mean :510.3 Mean :41.31 Mean : 445.43
## 3rd Qu.:764.2 3rd Qu.:54.00 3rd Qu.: 900.00
## Max. :998.0 Max. :64.00 Max. :2000.00
Statistical description:
The average age of the customers is 41,31 years, with the youngest customer being 18 and the oldest being 64 years old. 25% of the customers were younger than 29 years old down until 18 years of age, and 75% of the customers were at least 54 years old, with oldest one being 64 years old.
Their total expenditure varies from 25 monetary units per transaction to 2000 monetary units with transaction, with the average value of transaction bwing 445,43 monetary units. 25% of customers spent less than 71,25 monetary units, with minimum spending per transaction being 25 monetary units. 75% of customers spent at least 900 monetary units with maximum spending being 2000 monetary units.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(myrelevantdata[ , -1], smooth=FALSE)
We need to check the assumptions:
library(car)
scatterplot(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
smooth = TRUE,
boxplot = FALSE,
main = "Relationship between the Age and Total expenditure",
xlab = "Age",
ylab = "Total Expenditure")
Yes, the linearity assumption is violated since there is no relationship
between age and total expenditure.
shapiro.test(myrelevantdata$Age)
##
## Shapiro-Wilk normality test
##
## data: myrelevantdata$Age
## W = 0.94746, p-value = 2.523e-12
H0: Variable Age is normally distributed. H1: Variable Age is not normally distributed.
We can reject H0 at p>0,001,and conclude that age is not normally distributed at p>0,001.
shapiro.test(myrelevantdata$Total_Expenditure)
##
## Shapiro-Wilk normality test
##
## data: myrelevantdata$Total_Expenditure
## W = 0.73516, p-value < 2.2e-16
H0: Variable Total Expenditure is normally distributed. H1: Variable Total Expenditure is not normally distributed.
We can reject H0 at p>0,001,and conclude that Total Expenditure is not normally distributed at p>0,001.
Neither of distributions are normal, so I will use Spearman correlation test.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(myrelevantdata[ , -1]),
type = "pearson")
## Age Total_Expenditure
## Age 1.00 -0.07
## Total_Expenditure -0.07 1.00
##
## n= 500
##
##
## P
## Age Total_Expenditure
## Age 0.113
## Total_Expenditure 0.113
We need to perform Spearmant test, because we have ordinal variable & neither of the variables was normally distributed:
library(Hmisc)
rcorr(as.matrix(myrelevantdata[ , -1]),
type = "spearman")
## Age Total_Expenditure
## Age 1.00 -0.03
## Total_Expenditure -0.03 1.00
##
## n= 500
##
##
## P
## Age Total_Expenditure
## Age 0.5445
## Total_Expenditure 0.5445
cor(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
method = "spearman",
use = "complete.obs")
## [1] -0.02716458
cor.test(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
method = "spearman",
use = "complete.obs")
## Warning in cor.test.default(myrelevantdata$Age,
## myrelevantdata$Total_Expenditure, : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: myrelevantdata$Age and myrelevantdata$Total_Expenditure
## S = 21399177, p-value = 0.5445
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.02716458
H0: Population correlation coefficient between Age and Total expenditure is 0. H1: Population correlation coefficient between Age and Total expenditure is not 0.
We cannot reject H0 at p=0,55. We cannot conclude that the Age and Total expenditure are correlated.
Product Category is the category of the purchased product (e.g., Electronics, Clothing, Beauty)
mysecondrelevantdata <- sampled_data[, -c(2,3,5,7,8,9)] #removing the irrelevant variables
colnames(mysecondrelevantdata) <- c("Transaction_ID", "Gender", "Category") #renaming the columns
mysecondrelevantdata$Gender <- factor(mysecondrelevantdata$Gender, levels = c("Male", "Female"), labels = c("Male", "Female"))
mysecondrelevantdata$Category <- factor(mysecondrelevantdata$Category, levels = c("Clothing", "Beauty", "Electronics"), labels = c("Clothing", "Beauty", "Electronics"))
head(mysecondrelevantdata)
## Transaction_ID Gender Category
## 1 836 Female Clothing
## 2 679 Female Beauty
## 3 129 Female Beauty
## 4 930 Male Clothing
## 5 509 Female Electronics
## 6 471 Male Clothing
summary(mysecondrelevantdata)
## Transaction_ID Gender Category
## Min. : 1.0 Male :238 Clothing :175
## 1st Qu.:270.8 Female:262 Beauty :162
## Median :510.5 Electronics:163
## Mean :510.3
## 3rd Qu.:764.2
## Max. :998.0
results <- chisq.test(mysecondrelevantdata$Category, mysecondrelevantdata$Gender,
correct = FALSE)
results
##
## Pearson's Chi-squared test
##
## data: mysecondrelevantdata$Category and mysecondrelevantdata$Gender
## X-squared = 2.0934, df = 2, p-value = 0.3511
H0: There are no associations between the two categorical variables. H1: There is association between the two categorical variables.
We cannot reject the H0 at p=0,36. We cannot reject that there is no association between the gender and the category of purchase.
addmargins(results$observed)
## mysecondrelevantdata$Gender
## mysecondrelevantdata$Category Male Female Sum
## Clothing 91 84 175
## Beauty 73 89 162
## Electronics 74 89 163
## Sum 238 262 500
Here we can see the observed frequencies.
round(results$expected, 2)
## mysecondrelevantdata$Gender
## mysecondrelevantdata$Category Male Female
## Clothing 83.30 91.70
## Beauty 77.11 84.89
## Electronics 77.59 85.41
Here we can see the expected frequencies, rounded up to two decimals.
round(results$res, 2)
## mysecondrelevantdata$Gender
## mysecondrelevantdata$Category Male Female
## Clothing 0.84 -0.80
## Beauty -0.47 0.45
## Electronics -0.41 0.39
The residuals between the observed and expected values is not statistically significant, because all values of standardized residuals are below 1,96 (where alpha is 5%).
addmargins(round(prop.table(results$observed), 3))
## mysecondrelevantdata$Gender
## mysecondrelevantdata$Category Male Female Sum
## Clothing 0.182 0.168 0.350
## Beauty 0.146 0.178 0.324
## Electronics 0.148 0.178 0.326
## Sum 0.476 0.524 1.000
Explanation of random value: Out of all 500 customers, 17,8% of them were female and bought “Beauty” category products in their purchase.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mysecondrelevantdata$Gender
## mysecondrelevantdata$Category Male Female Sum
## Clothing 0.520 0.480 1.000
## Beauty 0.451 0.549 1.000
## Electronics 0.454 0.546 1.000
Explanation of a random value: Out of all customers that bought “Beauty” category products in their purchase, 54,9% of them were female.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mysecondrelevantdata$Gender
## mysecondrelevantdata$Category Male Female
## Clothing 0.382 0.321
## Beauty 0.307 0.340
## Electronics 0.311 0.340
## Sum 1.000 1.001
Explanation of a random value: Out of all female customers, 34% of them bought “Beauty” category products in their purchase.
library(effectsize)
effectsize::cramers_v(mysecondrelevantdata$Category, mysecondrelevantdata$Gender)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.01 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.01)
## [1] "tiny"
## (Rules: funder2019)
We can conclude there is no association between gender and category of purchase, as the effect (r=0,01) is tiny.