Homework 2

Lana Rakovec

Homework 2

RQ1: Is there a correlation between the customer’s age and their total spending per transaction?

mydata <- read.table("~/hw17/dataset.csv",
                    header = TRUE,
                    sep=",",
                    dec=".") #Creating the dataset
head(mydata)

##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1              1 2023-11-24     CUST001   Male  34           Beauty        3
## 2              2 2023-02-27     CUST002 Female  26         Clothing        2
## 3              3 2023-01-13     CUST003   Male  50      Electronics        1
## 4              4 2023-05-21     CUST004   Male  37         Clothing        1
## 5              5 2023-05-06     CUST005   Male  30           Beauty        2
## 6              6 2023-04-25     CUST006 Female  45           Beauty        1
##   Price.per.Unit Total.Amount
## 1             50          150
## 2            500         1000
## 3             30           30
## 4            500          500
## 5             50          100
## 6             30           30

Explanation of dataset: This data presents retail information recorded, including most important drivers of retail operations, which are Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount. They help with predicting sales trends, demographic influences and purchased behaviors.

The data has 1000 observations with 8 variables, where:

Transaction ID is a unique identifier for each transaction, allowing tracking and reference,
Date is the date when the transaction occurred, providing insights into sales trends over time,
Customer ID is unique identifier for each customer, enabling customer-centric analysis,
Gender is the gender of the customer (Male/Female), offering insights into gender-based purchasing patterns,
Age is the age of the customer, facilitating segmentation and exploration of age-related influences and is measured in years,
Product Category is the category of the purchased product (e.g., Electronics, Clothing, Beauty), helping understand product preferences,
Quantity is the number of units of the product purchased, contributing to insights on purchase volumes,
Price per Unit is the price of one unit of the product, aiding in calculations related to total spending and is measured in monetary units,
Total Amount is the total monetary value of the transaction, showcasing the financial impact of each purchase and is measured in monetary units.

Source of this dataset is from kaggle website:

Ranitsarkar. (2023, March 4). Yulu_Analysis-Hypothesis testing. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset

Unit of observation is a transaction (will be changed in cleaned dataset based on RQ).

This dataset is too big (1000) and has too many variables (irrelevant to the research question), so I am cleaning it and explaining the clean data below.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(1)
sampled_data <- mydata %>% sample_n(size = 500, replace = FALSE) #random sampling the data to a sample with 500 units.

head(sampled_data)

##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1            836 2023-04-19     CUST836 Female  22         Clothing        1
## 2            679 2023-01-11     CUST679 Female  18           Beauty        3
## 3            129 2023-04-23     CUST129 Female  21           Beauty        2
## 4            930 2023-05-10     CUST930   Male  54         Clothing        4
## 5            509 2023-06-26     CUST509 Female  37      Electronics        3
## 6            471 2023-03-23     CUST471   Male  32         Clothing        3
##   Price.per.Unit Total.Amount
## 1             50           50
## 2             30           90
## 3            300          600
## 4             50          200
## 5            300          900
## 6             50          150

myrelevantdata <- sampled_data[, -c(2,3,4,6,7,8)] #removing the irrelevant variables
colnames(myrelevantdata) <- c("Transaction_ID", "Age", "Total_Expenditure") #renaming the columns
head(myrelevantdata)

##   Transaction_ID Age Total_Expenditure
## 1            836  22                50
## 2            679  18                90
## 3            129  21               600
## 4            930  54               200
## 5            509  37               900
## 6            471  32               150

Based on my RQ,: Unit of observation is a transaction, variables are age and total expenditure per purchase. H0: correlation on population equals 0 H1: correlation on population does not equal 0

summary(myrelevantdata)

##  Transaction_ID       Age        Total_Expenditure
##  Min.   :  1.0   Min.   :18.00   Min.   :  25.00  
##  1st Qu.:270.8   1st Qu.:29.00   1st Qu.:  71.25  
##  Median :510.5   Median :42.00   Median : 120.00  
##  Mean   :510.3   Mean   :41.31   Mean   : 445.43  
##  3rd Qu.:764.2   3rd Qu.:54.00   3rd Qu.: 900.00  
##  Max.   :998.0   Max.   :64.00   Max.   :2000.00

Statistical description:

The average age of the customers is 41,31 years, with the youngest customer being 18 and the oldest being 64 years old. 25% of the customers were younger than 29 years old down until 18 years of age, and 75% of the customers were at least 54 years old, with oldest one being 64 years old.
Their total expenditure varies from 25 monetary units per transaction to 2000 monetary units with transaction, with the average value of transaction bwing 445,43 monetary units. 25% of customers spent less than 71,25 monetary units, with minimum spending per transaction being 25 monetary units. 75% of customers spent at least 900 monetary units with maximum spending being 2000 monetary units.

RQ1: Is there a correlation between the customer’s age and their total spending per transaction?

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix(myrelevantdata[ , -1], smooth=FALSE)

We need to check the assumptions:

Both variables are numeric. -> This assumption is met.
Errors are normally distributed. Since we have big enough sample (500 units), we do not test to check this, but based on the graphs, they are not normally distributed, so I will still perform Shapiro-Wilk test.
Linear relationship between the variables. -> The assumption is violated, because the values are all over the graph, but let’s still check with a more precise scatterplot:

library(car)
scatterplot(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
            smooth = TRUE,
            boxplot = FALSE,
            main = "Relationship between the Age and Total expenditure",
            xlab = "Age",
            ylab = "Total Expenditure")

Yes, the linearity assumption is violated since there is no relationship between age and total expenditure.

shapiro.test(myrelevantdata$Age)

## 
##  Shapiro-Wilk normality test
## 
## data:  myrelevantdata$Age
## W = 0.94746, p-value = 2.523e-12

H0: Variable Age is normally distributed. H1: Variable Age is not normally distributed.

We can reject H0 at p>0,001,and conclude that age is not normally distributed at p>0,001.

shapiro.test(myrelevantdata$Total_Expenditure)

## 
##  Shapiro-Wilk normality test
## 
## data:  myrelevantdata$Total_Expenditure
## W = 0.73516, p-value < 2.2e-16

H0: Variable Total Expenditure is normally distributed. H1: Variable Total Expenditure is not normally distributed.

We can reject H0 at p>0,001,and conclude that Total Expenditure is not normally distributed at p>0,001.

Neither of distributions are normal, so I will use Spearman correlation test.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(myrelevantdata[ , -1]), 
      type = "pearson")

##                     Age Total_Expenditure
## Age                1.00             -0.07
## Total_Expenditure -0.07              1.00
## 
## n= 500 
## 
## 
## P
##                   Age   Total_Expenditure
## Age                     0.113            
## Total_Expenditure 0.113

We need to perform Spearmant test, because we have ordinal variable & neither of the variables was normally distributed:

library(Hmisc)
rcorr(as.matrix(myrelevantdata[ , -1]), 
      type = "spearman")

##                     Age Total_Expenditure
## Age                1.00             -0.03
## Total_Expenditure -0.03              1.00
## 
## n= 500 
## 
## 
## P
##                   Age    Total_Expenditure
## Age                      0.5445           
## Total_Expenditure 0.5445

cor(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
    method = "spearman",
    use = "complete.obs")

## [1] -0.02716458

cor.test(myrelevantdata$Age, myrelevantdata$Total_Expenditure,
    method = "spearman",
    use = "complete.obs")

## Warning in cor.test.default(myrelevantdata$Age,
## myrelevantdata$Total_Expenditure, : Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  myrelevantdata$Age and myrelevantdata$Total_Expenditure
## S = 21399177, p-value = 0.5445
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.02716458

H0: Population correlation coefficient between Age and Total expenditure is 0. H1: Population correlation coefficient between Age and Total expenditure is not 0.

We cannot reject H0 at p=0,55. We cannot conclude that the Age and Total expenditure are correlated.

RQ2:Is there a correlation between the gender and category of purchase?

Product Category is the category of the purchased product (e.g., Electronics, Clothing, Beauty)

mysecondrelevantdata <- sampled_data[, -c(2,3,5,7,8,9)] #removing the irrelevant variables
colnames(mysecondrelevantdata) <- c("Transaction_ID", "Gender", "Category") #renaming the columns
mysecondrelevantdata$Gender <- factor(mysecondrelevantdata$Gender, levels = c("Male", "Female"), labels = c("Male", "Female"))
mysecondrelevantdata$Category <- factor(mysecondrelevantdata$Category, levels = c("Clothing", "Beauty", "Electronics"), labels = c("Clothing", "Beauty", "Electronics"))
head(mysecondrelevantdata)

##   Transaction_ID Gender    Category
## 1            836 Female    Clothing
## 2            679 Female      Beauty
## 3            129 Female      Beauty
## 4            930   Male    Clothing
## 5            509 Female Electronics
## 6            471   Male    Clothing

summary(mysecondrelevantdata)

##  Transaction_ID     Gender           Category  
##  Min.   :  1.0   Male  :238   Clothing   :175  
##  1st Qu.:270.8   Female:262   Beauty     :162  
##  Median :510.5                Electronics:163  
##  Mean   :510.3                                 
##  3rd Qu.:764.2                                 
##  Max.   :998.0

results <- chisq.test(mysecondrelevantdata$Category, mysecondrelevantdata$Gender, 
                      correct = FALSE)
results

## 
##  Pearson's Chi-squared test
## 
## data:  mysecondrelevantdata$Category and mysecondrelevantdata$Gender
## X-squared = 2.0934, df = 2, p-value = 0.3511

H0: There are no associations between the two categorical variables. H1: There is association between the two categorical variables.

We cannot reject the H0 at p=0,36. We cannot reject that there is no association between the gender and the category of purchase.

addmargins(results$observed)

##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category Male Female Sum
##                   Clothing      91     84 175
##                   Beauty        73     89 162
##                   Electronics   74     89 163
##                   Sum          238    262 500

Here we can see the observed frequencies.

round(results$expected, 2)

##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female
##                   Clothing    83.30  91.70
##                   Beauty      77.11  84.89
##                   Electronics 77.59  85.41

Here we can see the expected frequencies, rounded up to two decimals.

round(results$res, 2)

##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female
##                   Clothing     0.84  -0.80
##                   Beauty      -0.47   0.45
##                   Electronics -0.41   0.39

The residuals between the observed and expected values is not statistically significant, because all values of standardized residuals are below 1,96 (where alpha is 5%).

addmargins(round(prop.table(results$observed), 3))

##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female   Sum
##                   Clothing    0.182  0.168 0.350
##                   Beauty      0.146  0.178 0.324
##                   Electronics 0.148  0.178 0.326
##                   Sum         0.476  0.524 1.000

Explanation of random value: Out of all 500 customers, 17,8% of them were female and bought “Beauty” category products in their purchase.

addmargins(round(prop.table(results$observed, 1), 3), 2)

##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female   Sum
##                   Clothing    0.520  0.480 1.000
##                   Beauty      0.451  0.549 1.000
##                   Electronics 0.454  0.546 1.000

Explanation of a random value: Out of all customers that bought “Beauty” category products in their purchase, 54,9% of them were female.

addmargins(round(prop.table(results$observed, 2), 3), 1)

##                              mysecondrelevantdata$Gender
## mysecondrelevantdata$Category  Male Female
##                   Clothing    0.382  0.321
##                   Beauty      0.307  0.340
##                   Electronics 0.311  0.340
##                   Sum         1.000  1.001

Explanation of a random value: Out of all female customers, 34% of them bought “Beauty” category products in their purchase.

library(effectsize)
effectsize::cramers_v(mysecondrelevantdata$Category, mysecondrelevantdata$Gender)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.01              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.01)

## [1] "tiny"
## (Rules: funder2019)

We can conclude there is no association between gender and category of purchase, as the effect (r=0,01) is tiny.

Homework 2

2024-01-18

Lana Rakovec

Homework 2

RQ1: Is there a correlation between the customer’s age and their total spending per transaction?

RQ2:Is there a correlation between the gender and category of purchase?