Lana Rakovec

Homework 2

RQ1: Is there a correlation between the customer’s age and their total spending per transaction?

mydata <- read.table("~/hw17/dataset.csv",
                    header = TRUE,
                    sep=",",
                    dec=".") #Creating the dataset
head(mydata)
##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1              1 2023-11-24     CUST001   Male  34           Beauty        3
## 2              2 2023-02-27     CUST002 Female  26         Clothing        2
## 3              3 2023-01-13     CUST003   Male  50      Electronics        1
## 4              4 2023-05-21     CUST004   Male  37         Clothing        1
## 5              5 2023-05-06     CUST005   Male  30           Beauty        2
## 6              6 2023-04-25     CUST006 Female  45           Beauty        1
##   Price.per.Unit Total.Amount
## 1             50          150
## 2            500         1000
## 3             30           30
## 4            500          500
## 5             50          100
## 6             30           30

Explanation of dataset: This data presents retail information recorded, including most important drivers of retail operations, which are Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount. They help with predicting sales trends, demographic influences and purchased behaviors.

The data has 1000 observations with 8 variables, where:

Source of this dataset is from kaggle website:

Ranitsarkar. (2023, March 4). Yulu_Analysis-Hypothesis testing. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset

Unit of observation is a transaction (will be changed in cleaned dataset based on RQ).

This dataset is too big (1000) and has too many variables (irrelevant to the research question), so I am cleaning it and explaining the clean data below.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(1)
sampled_data <- mydata %>% sample_n(size = 500, replace = FALSE) #random sampling the data to a sample with 500 units.
head(sampled_data)
##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1            836 2023-04-19     CUST836 Female  22         Clothing        1
## 2            679 2023-01-11     CUST679 Female  18           Beauty        3
## 3            129 2023-04-23     CUST129 Female  21           Beauty        2
## 4            930 2023-05-10     CUST930   Male  54         Clothing        4
## 5            509 2023-06-26     CUST509 Female  37      Electronics        3
## 6            471 2023-03-23     CUST471   Male  32         Clothing        3
##   Price.per.Unit Total.Amount
## 1             50           50
## 2             30           90
## 3            300          600
## 4             50          200
## 5            300          900
## 6             50          150
myrelevantdata <- sampled_data[, -c(2,3,4,6,7,8)] #removing the irrelevant variables
colnames(myrelevantdata) <- c("Transaction_ID", "Age", "Total_Expenditure") #renaming the columns
head(myrelevantdata)
##   Transaction_ID Age Total_Expenditure
## 1            836  22                50
## 2            679  18                90
## 3            129  21               600
## 4            930  54               200
## 5            509  37               900
## 6            471  32               150

Based on my RQ,: Unit of observation is a transaction, variables are age and total expenditure per purchase. H0: correlation on population equals 0 H1: correlation on population does not equal 0

summary(myrelevantdata)
##  Transaction_ID       Age        Total_Expenditure
##  Min.   :  1.0   Min.   :18.00   Min.   :  25.00  
##  1st Qu.:270.8   1st Qu.:29.00   1st Qu.:  71.25  
##  Median :510.5   Median :42.00   Median : 120.00  
##  Mean   :510.3   Mean   :41.31   Mean   : 445.43  
##  3rd Qu.:764.2   3rd Qu.:54.00   3rd Qu.: 900.00  
##  Max.   :998.0   Max.   :64.00   Max.   :2000.00

Statistical description:

RQ1: Is there a correlation between the customer’s age and their total spending per transaction?

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(myrelevantdata[ , -1], smooth=FALSE)

We need to check the assumptions:

library(car)
scatterplot(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
            smooth = TRUE,
            boxplot = FALSE,
            main = "Relationship between the Age and Total expenditure",
            xlab = "Age",
            ylab = "Total Expenditure")  

Yes, the linearity assumption is violated since there is no relationship between age and total expenditure.

shapiro.test(myrelevantdata$Age)
## 
##  Shapiro-Wilk normality test
## 
## data:  myrelevantdata$Age
## W = 0.94746, p-value = 2.523e-12

H0: Variable Age is normally distributed. H1: Variable Age is not normally distributed.

We can reject H0 at p>0,001,and conclude that age is not normally distributed at p>0,001.

shapiro.test(myrelevantdata$Total_Expenditure)
## 
##  Shapiro-Wilk normality test
## 
## data:  myrelevantdata$Total_Expenditure
## W = 0.73516, p-value < 2.2e-16

H0: Variable Total Expenditure is normally distributed. H1: Variable Total Expenditure is not normally distributed.

We can reject H0 at p>0,001,and conclude that Total Expenditure is not normally distributed at p>0,001.

Neither of distributions are normal, so I will use Spearman correlation test.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(myrelevantdata[ , -1]), 
      type = "pearson")
##                     Age Total_Expenditure
## Age                1.00             -0.07
## Total_Expenditure -0.07              1.00
## 
## n= 500 
## 
## 
## P
##                   Age   Total_Expenditure
## Age                     0.113            
## Total_Expenditure 0.113

We need to perform Spearmant test, because we have ordinal variable & neither of the variables was normally distributed:

library(Hmisc)
rcorr(as.matrix(myrelevantdata[ , -1]), 
      type = "spearman")
##                     Age Total_Expenditure
## Age                1.00             -0.03
## Total_Expenditure -0.03              1.00
## 
## n= 500 
## 
## 
## P
##                   Age    Total_Expenditure
## Age                      0.5445           
## Total_Expenditure 0.5445
cor(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
    method = "spearman",
    use = "complete.obs")
## [1] -0.02716458
cor.test(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
    method = "spearman",
    use = "complete.obs")
## Warning in cor.test.default(myrelevantdata$Age,
## myrelevantdata$Total_Expenditure, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  myrelevantdata$Age and myrelevantdata$Total_Expenditure
## S = 21399177, p-value = 0.5445
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.02716458

H0: Population correlation coefficient between Age and Total expenditure is 0. H1: Population correlation coefficient between Age and Total expenditure is not 0.

We cannot reject H0 at p=0,55. We cannot conclude that the Age and Total expenditure are correlated.

RQ2:Is there a correlation between the gender and category of purchase?

Product Category is the category of the purchased product (e.g., Electronics, Clothing, Beauty)

mysecondrelevantdata <- sampled_data[, -c(2,3,5,7,8,9)] #removing the irrelevant variables
colnames(mysecondrelevantdata) <- c("Transaction_ID", "Gender", "Category") #renaming the columns
mysecondrelevantdata$Gender <- factor(mysecondrelevantdata$Gender, levels = c("Male", "Female"), labels = c("Male", "Female"))
mysecondrelevantdata$Category <- factor(mysecondrelevantdata$Category, levels = c("Clothing", "Beauty", "Electronics"), labels = c("Clothing", "Beauty", "Electronics"))
head(mysecondrelevantdata)
##   Transaction_ID Gender    Category
## 1            836 Female    Clothing
## 2            679 Female      Beauty
## 3            129 Female      Beauty
## 4            930   Male    Clothing
## 5            509 Female Electronics
## 6            471   Male    Clothing
summary(mysecondrelevantdata)
##  Transaction_ID     Gender           Category  
##  Min.   :  1.0   Male  :238   Clothing   :175  
##  1st Qu.:270.8   Female:262   Beauty     :162  
##  Median :510.5                Electronics:163  
##  Mean   :510.3                                 
##  3rd Qu.:764.2                                 
##  Max.   :998.0
results <- chisq.test(mysecondrelevantdata$Category, mysecondrelevantdata$Gender, 
                      correct = FALSE)
results
## 
##  Pearson's Chi-squared test
## 
## data:  mysecondrelevantdata$Category and mysecondrelevantdata$Gender
## X-squared = 2.0934, df = 2, p-value = 0.3511

H0: There are no associations between the two categorical variables. H1: There is association between the two categorical variables.

We cannot reject the H0 at p=0,36. We cannot reject that there is no association between the gender and the category of purchase.

addmargins(results$observed)
##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category Male Female Sum
##                   Clothing      91     84 175
##                   Beauty        73     89 162
##                   Electronics   74     89 163
##                   Sum          238    262 500

Here we can see the observed frequencies.

round(results$expected, 2)
##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female
##                   Clothing    83.30  91.70
##                   Beauty      77.11  84.89
##                   Electronics 77.59  85.41

Here we can see the expected frequencies, rounded up to two decimals.

round(results$res, 2)
##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female
##                   Clothing     0.84  -0.80
##                   Beauty      -0.47   0.45
##                   Electronics -0.41   0.39

The residuals between the observed and expected values is not statistically significant, because all values of standardized residuals are below 1,96 (where alpha is 5%).

addmargins(round(prop.table(results$observed), 3))
##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female   Sum
##                   Clothing    0.182  0.168 0.350
##                   Beauty      0.146  0.178 0.324
##                   Electronics 0.148  0.178 0.326
##                   Sum         0.476  0.524 1.000

Explanation of random value: Out of all 500 customers, 17,8% of them were female and bought “Beauty” category products in their purchase.

addmargins(round(prop.table(results$observed, 1), 3), 2) 
##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female   Sum
##                   Clothing    0.520  0.480 1.000
##                   Beauty      0.451  0.549 1.000
##                   Electronics 0.454  0.546 1.000

Explanation of a random value: Out of all customers that bought “Beauty” category products in their purchase, 54,9% of them were female.

addmargins(round(prop.table(results$observed, 2), 3), 1) 
##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female
##                   Clothing    0.382  0.321
##                   Beauty      0.307  0.340
##                   Electronics 0.311  0.340
##                   Sum         1.000  1.001

Explanation of a random value: Out of all female customers, 34% of them bought “Beauty” category products in their purchase.

library(effectsize)
effectsize::cramers_v(mysecondrelevantdata$Category, mysecondrelevantdata$Gender)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.01              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.01)
## [1] "tiny"
## (Rules: funder2019)

We can conclude there is no association between gender and category of purchase, as the effect (r=0,01) is tiny.