Venetia Polyzou

mydata <- read.table("./Shopping Mall Customer Segmentation Data2 .csv",
                     header = TRUE,
                     sep = ",",
                     dec = ",")
head(mydata)
##                            Customer.ID Age Gender Annual.Income Spending.Score
## 1 d410ea53-6661-42a9-ad3a-f554b05fd2a7  30   Male        151479             89
## 2 1770b26f-493f-46b6-837f-4237fb5a314e  58 Female        185088             95
## 3 e81aa8eb-1767-4b77-87ce-1620dc732c5e  62 Female         70912             76
## 4 9795712a-ad19-47bf-8886-4f997d6046e3  23   Male         55460             57
## 5 64139426-2226-4cd6-bf09-91bce4b4db5e  24   Male        153752             76
## 6 7e211337-e92f-4140-8231-5c9ac7a2aa12  42   Male        158335             40

General:

Variables:

  1. ID:
  1. Age:
  1. Gender:
  1. Annual Income:
  1. Spending Score:

Source of the data: kaggle.com (https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data)

mydata$GenderF <- factor(mydata$Gender, 
                         levels = c("Male", "Female"),
                         labels = c("Male", "Female"))
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
mydata <- mydata %>%
  rename(annual.income = Annual.Income) %>%
  drop_na()
mydataF <- mydata[mydata$annual.income > 150000 , ]
library(dplyr)
mydata2F <- mydata %>%
  filter(Gender == "Male" )
mydata3 <- mydata[mydata$Spending.Score >= 60 & mydata$Spending.Score <= 80 , ]
summary(mydata[ ,c(-1,-3,-6)])
##       Age        annual.income    Spending.Score  
##  Min.   :18.00   Min.   : 22655   Min.   :  1.00  
##  1st Qu.:32.50   1st Qu.: 69202   1st Qu.: 27.00  
##  Median :48.00   Median :111526   Median : 45.00  
##  Mean   :49.57   Mean   :112493   Mean   : 48.38  
##  3rd Qu.:65.00   3rd Qu.:157317   3rd Qu.: 73.50  
##  Max.   :90.00   Max.   :199879   Max.   :100.00

Calculated the descriptive statistics for the numeric variables, excluding the ID and the categorical variable of Gender.

library(psych)
describeBy(mydata$Spending.Score, g = mydata$GenderF)
## 
##  Descriptive statistics by group 
## group: Male
##    vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 93 46.85 28.85     42   46.03 35.58   1 100    99 0.23    -1.09 2.99
## ------------------------------------------------------------ 
## group: Female
##    vars   n  mean   sd median trimmed   mad min max range skew kurtosis   se
## X1    1 106 49.72 28.3   50.5   49.78 34.84   1  98    97 0.02    -1.16 2.75

Research question:

We will use the independent sample t-test, because we have one numeric variable(Spending. Score) and one factorial that has 2 independent groups(Gender:Male/Female).

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata, aes(x = Spending.Score)) +
  geom_histogram(binwidth = 30, colour = "pink") +
  facet_wrap(~GenderF, ncol = 1) +
  ylab("Frequency")

According to the above histograms the distributions of the spending score for both males and females do not seem to be normal.

In order to be sure we will do the two Shapiro tests, one for the normality of the spending score of males and one for the normality of the spending score of females.

##install.packages("rstatix")
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(dplyr)
mydata %>%
  group_by(GenderF) %>%
  shapiro_test(Spending.Score)
## # A tibble: 2 × 4
##   GenderF variable       statistic       p
##   <fct>   <chr>              <dbl>   <dbl>
## 1 Male    Spending.Score     0.953 0.00210
## 2 Female  Spending.Score     0.957 0.00177

According to the 2 Shapiro tests above the Hypotheses are:

For males:

The null hypothesis (H0) is rejected at the p value = 0.003, so we assume that the distribution is not normal.

For females:

The null hypothesis (H0) is rejected at the p value = 0.002, so we assume that the distribution is not normal.

So, the assumption of normality is being violated.

##install.packages("ggpubr")
library(ggpubr)
ggqqplot(mydata,
         "Spending.Score",
         facet.by = "GenderF")

According to the 2 Quantile -Quantile plots:

So, according to the above three ways(histograms, Shapiro Tests, Quantile - Quantile plots) we assume the violation of normality for both groups and that is why we need to use the non - parametric test of Wilcoxon rank - sum test.

##install.packages("car")
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(mydata$Spending.Score, group = mydata$GenderF)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.0027 0.9588
##       197

We also use the Levene’s test to check the other assumption of the homogenity of variances for the spending score of each group.

For the Levene’s test the Hypotheses are:

We can’t(don’t have enough evidence to) reject the null Hypothesis(H0), meaning that we will assume that the variances of the spending score for males and females are the same. As a result, we do not need to do the Welch correction.

t.test(mydata$Spending.Score ~ mydata$GenderF,
       var.equal = TRUE,
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  mydata$Spending.Score by mydata$GenderF
## t = -0.70671, df = 197, p-value = 0.4806
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -10.869360   5.134322
## sample estimates:
##   mean in group Male mean in group Female 
##             46.84946             49.71698

Due to the fact that the assumption of normality is not met, the non - parametric test of Wilcoxon - rank sum test should be done. However,only for the needs of this home assignment we will also do the parametric test of the independent samples t-test.

Independent samples t-test Hypotheses:

We can’t(don’t have enough evidence to) reject the null Hypothesis(H0), so we can’t say that the mean of the spending score of males is different from the mean of the spending score of females.

##install.packages("effectsize")
library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cohens_d(mydata$Spending.Score ~ mydata$GenderF,
                     pooled_sd = FALSE)
## Cohen's d |        95% CI
## -------------------------
## -0.10     | [-0.38, 0.18]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.10, rules = "sawilowsky2009")
## [1] "very small"
## (Rules: sawilowsky2009)
wilcox.test(mydata$Spending.Score ~ mydata$GenderF,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Spending.Score by mydata$GenderF
## W = 4644, p-value = 0.4819
## alternative hypothesis: true location shift is not equal to 0

However, due to the fact that the assumption of the normality is being violated we should use the non-parametric test of Wilcoxon rank-sum test.

Wilcoxon rank-sum test:

We can’t(don’t have enough evidence to) reject the null Hypothesis(H0). So, we can’t say that the local distributions of the spending score for males and females are different.

effectsize(wilcox.test(mydata$Spending.Score ~ mydata$GenderF,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## -0.06             | [-0.22, 0.10]
interpret_rank_biserial(0.06)
## [1] "very small"
## (Rules: funder2019)

Summing up: **Research question:

We will use the independent sample t-test because, we have one numeric variable(Spending. Score) and one factorial that has 2 independent groups(Gender:Male/Female). Also, we need to check the assumption of the normality of the spending score for each group(both males and females) in order to know if we will use the parametric or the non-parametric test. Lastly, we need to check the homogeneity of the variances, to know if we will do the Welch correction or not.

So: