Let’s import and display our dataset.
customers <- read.csv("Mall_Customers.csv", header = TRUE)
head(customers)
We are using read.csv() to load our dataset,
Mall_Customers.csv.
Our dataset contains 200 observations of 5 variables. It provides some basic data about the shop customers like Customer ID, age, gender, annual income and spending score.
The unit of observation in our case is one individual customer.
The sample size is 200.
Variables and definitions:
| Variable name | Type | Description |
|---|---|---|
| Customer ID | Categorical (Nominal) | Unique identifier for each customer |
| Gender | Categorical (Nominal) | Customer’s gender |
| Age | Numerical (Ratio) | Customer’s age in years |
| Annual income | Numerical (Ratio) | Customer’s annual income in thousands of USD |
| Spending score | Numerical (Interval) | Score based on customer spending habits (1–100 scale) |
The original column names are somewhat awkward (e.g.,
“Annual.Income..k..”), therefore we use
names(customers)[...] <- ... to give them more readable
titles.
names(customers)[names(customers) == "CustomerID"] <- "Customer ID"
names(customers)[names(customers) == "Annual.Income..k.."] <- "Annual Income (USD in thousands)"
names(customers)[names(customers) == "Spending.Score..1.100."] <- "Spending Score (1-100)"
colnames(customers)
## [1] "Customer ID" "Gender"
## [3] "Age" "Annual Income (USD in thousands)"
## [5] "Spending Score (1-100)"
Let’s examine the structure of our dataset with
str() that shows how each column is stored
(numeric, factor, character etc.).
str(customers)
## 'data.frame': 200 obs. of 5 variables:
## $ Customer ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual Income (USD in thousands): int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending Score (1-100) : int 39 81 6 77 40 76 6 94 3 72 ...
Let’s define Gender as a factor with two levels, “Male”
and “Female” to help R understand that “Gender” is
categorical, useful for summary statistics and
plotting.
customers$Gender <- factor(customers$Gender, levels = c("Male", "Female"))
str(customers)
## 'data.frame': 200 obs. of 5 variables:
## $ Customer ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 2 2 2 1 2 ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual Income (USD in thousands): int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending Score (1-100) : int 39 81 6 77 40 76 6 94 3 72 ...
sum(is.na(customers))
## [1] 0
We see that there is no NAs in our dataset, but for the sake of practice let’s clean the data.
cat("Number of rows before removing NAs:", nrow(customers), "\n")
## Number of rows before removing NAs: 200
customers <- na.omit(customers)
cat("Number of rows after removing NAs:", nrow(customers), "\n")
## Number of rows after removing NAs: 200
We load the psych package to use its
describe() function, which returns stats like mean, SD,
median, min, and max etc.
library(psych)
describe(customers[ , -c(1,2)])
The average age of customers is 38.9 years old.
Mean annual income is 60.6 thousand USD.
From skew we can say that the distribution of both
age and annual income is skewed to the right.
Median for income is 61.5 thousand USD, meaning that 50% of the customers earn ≤61.5k and 50% earn >61.5k USD.
library(pastecs)
round(stat.desc(customers[ , -c(1,2)]), 2)
Let’s use stat.desc() from pastecs, which
also offers a wide range of descriptive stats.
summary(customers[ , -1])
## Gender Age Annual Income (USD in thousands)
## Male : 88 Min. :18.00 Min. : 15.00
## Female:112 1st Qu.:28.75 1st Qu.: 41.50
## Median :36.00 Median : 61.50
## Mean :38.85 Mean : 60.56
## 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :70.00 Max. :137.00
## Spending Score (1-100)
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
We could also use the base R function summary(), which
gives min, 1st quartile, median, mean, 3rd quartile, and max for each
numeric variable. We keep Gender to see how many rows have
each factor (88 males and 112 females).
tapply(customers$`Annual Income (USD in thousands)`, customers$Gender, summary)
## $Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 45.50 62.50 62.23 78.00 137.00
##
## $Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 39.75 60.00 59.25 77.25 126.00
Here, we compute summary statistics for annual income grouped
by Gender.
This allows us to compare incomes between males and females quickly and we can see that the male part of the customers has a higher annual income on average than female (62.23 vs 59.25).
Let’s create a new categorical variable called
IncomeCategory based on annual income.
library(dplyr)
customers <- customers %>%
mutate(IncomeCategory = ifelse(`Annual Income (USD in thousands)` < 30, "Low",
ifelse(`Annual Income (USD in thousands)` <= 90, "Medium", "High")))
customers$IncomeCategory <- as.factor(customers$IncomeCategory)
table(customers$IncomeCategory)
##
## High Low Medium
## 22 30 148
prop.table(table(customers$IncomeCategory))
##
## High Low Medium
## 0.11 0.15 0.74
Here, we categorize “Low” (< $30k), “Medium” (between $30k and $90k), and “High” (above $90k).
We can see that most of the customers (74%) fall into “Medium” income category.
Management of city central mall has implemented a new metric “Spending Score” (ranging from 1 to 100) that summarizes a customer’s purchasing frequency, average basket size etc. After collecting data on 200 shoppers, management wants to see if resources should be allocated differently to marketing campaigns targeting male vs female shoppers.
Thus, the key research question is:
“Do male and female customers at city central mall differ in their average spending score?”
To address the research question, we set up a formal hypothesis test comparing two population means (independent samples):
Null Hypothesis (\(H_0\)):
\(\mu_\text{Male} =
\mu_\text{Female}\)
In other words, there is no difference in the mean
Spending Score between male and female customers.
Alternative Hypothesis (\(H_1\)):
\(\mu_\text{Male} \ne
\mu_\text{Female}\)
There IS a difference in the mean Spending Score
between male and female customers.
This is a two-sided test.
To use a two-sample independent t-test, we generally assume:
Variable is numeric
The distribution of the variable is normal in both populations
Independent samples
Variable has the same variance in both populations – if no, we apply Welch correction.
Assumptions 1 and 3 are true in our scenario. Let’s check for other ones.
As discussed in the lectures, it is not advised to conduct the Shapiro-Wilk Test on big samples, since it is extremely sensitive to small deviations. However, for the sake of practice, let’s conduct one.
by(customers$`Spending Score (1-100)`, customers$Gender, shapiro.test)
## customers$Gender: Male
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.95218, p-value = 0.002627
##
## ------------------------------------------------------------
## customers$Gender: Female
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.97438, p-value = 0.02977
The null hypothesis is that the sample belongs to a normal distribution. The alternative hypothesis is that the sample DOES NOT belong to a normal distribution.
Here, we would reject the null hypothesis in both cases (p-values are less than 0.05 in both results). However, this is not the case with our big dataset (200 observations).
Let’s visualize the distributions instead.
library(ggplot2)
library(ggpubr)
hist_male <- ggplot(
data = subset(customers, Gender == "Male"),
aes(x = `Spending Score (1-100)`)
) +
geom_histogram(
bins = 5,
alpha = 0.7,
color = "black",
fill = "blue"
) +
theme_minimal() +
labs(
title = "Male",
x = "Spending Score (1-100)",
y = "Frequency"
)
hist_female <- ggplot(
data = subset(customers, Gender == "Female"),
aes(x = `Spending Score (1-100)`)
) +
geom_histogram(
bins = 5,
alpha = 0.7,
color = "black",
fill = "red"
) +
theme_minimal() +
labs(
title = "Female",
x = "Spending Score (1-100)",
y = "Frequency"
)
ggarrange(hist_male, hist_female, ncol = 2, nrow = 1)
library(ggpubr)
library(ggplot2)
ggqqplot(
data = customers,
x = "`Spending Score (1-100)`",
color = "Gender",
fill = "Gender",
facet.by = "Gender",
legend = "none",
ggtheme = theme_minimal()
) +
scale_color_manual(values = c("Male" = "blue", "Female" = "red")) +
scale_fill_manual(values = c("Male" = "blue", "Female" = "red")) +
labs(
title = "Q-Q plots for Spending Score by Gender",
x = "Theoretical",
y = "Sample"
)
library(psych)
describeBy(customers$`Spending Score (1-100)`, customers$Gender)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 88 48.51 27.9 50 48.46 34.1 1 97 96 -0.06 -1.04 2.97
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 112 51.53 24.11 50 51.61 27.43 5 99 94 0.03 -0.81 2.28
Based on the visualizations, especially the Q–Q plots, we conclude that the variable is normally distributed in both samples.
library(car)
leveneTest(`Spending Score (1-100)` ~ Gender, data = customers)
In Levene’s Test for Homogeneity of Variance, the null hypothesis is that the variances of Spending Scores of male and female are equal. The alternative hypothesis is that variances are NOT equal for male and female.
The p-value is > 0.05, so we fail to reject the null hypothesis. We can assume equal variances.
t.test(
`Spending Score (1-100)` ~ Gender,
data = customers,
var.equal = TRUE,
alternative = "two.sided"
)
##
## Two Sample t-test
##
## data: Spending Score (1-100) by Gender
## t = -0.81905, df = 198, p-value = 0.4137
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -10.275652 4.244808
## sample estimates:
## mean in group Male mean in group Female
## 48.51136 51.52679
library(effectsize)
effectsize::cohens_d(customers$`Spending Score (1-100)` ~ customers$Gender,
pooled_sd = FALSE)
interpret_cohens_d(0.12, rules = "sawilowsky2009")
## [1] "very small"
## (Rules: sawilowsky2009)
The p-value is far greater than 0.05 (0.4137), therefore we fail to reject \(H_0\) – no evidence of a difference.
For the sake of practice, let’s assume that the normality is violated, therefore we decide to perform the alternative nonparametric test – Wilcoxon Rank Sum Test.
In this case, the hypotheses formulation would be a bit different:
Null Hypothesis (\(H_0\)):
The distribution location of Spending Score is the same for Male and
Female.
Alternative Hypothesis (\(H_1\)):
The distribution location of Spending Score is NOT the same for Male and
Female.
wilcox.test(customers$`Spending Score (1-100)` ~ customers$Gender,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: customers$`Spending Score (1-100)` by customers$Gender
## W = 4697.5, p-value = 0.5704
## alternative hypothesis: true location shift is not equal to 0
library(effectsize)
effectsize(wilcox.test(customers$`Spending Score (1-100)` ~ customers$Gender,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
interpret_rank_biserial(0.05)
## [1] "very small"
## (Rules: funder2019)
We can observe the p-value greater than 0.05 (0.5704), therefore we fail to reject the null hypothesis and the difference in distribution locations is very small.
Given our results, the parametric (two-sample) t-test is appropriate in this scenario, because our sample is reasonably large (n=200), and none of the assumtions were violated – Spending Score is a numeric variable, the distribution of the variable is normal in both populations, both samples are independent, and the variances for male and female are equal.
The t-test yielded a p-value much greater than 0.05, leading us to fail to reject the null hypothesis (\(\mu_\text{Male} = \mu_\text{Female}\)). Based on our sample data, we conclude that men and women do not differ in how much they spend in the mall, on average. In other words, there is no statistically significant difference in average Spending Score between male and female customers. The effect size (Cohen’s d) was also very small (~0.12).
From a business standpoint, there appears to be no justification for changing marketing resource allocation strictly on the basis of customer gender.