Home Assignment 2

Importing and cleaning the data

Let’s import and display our dataset.

customers <- read.csv("Mall_Customers.csv", header = TRUE)
head(customers)

We are using read.csv() to load our dataset, Mall_Customers.csv.

Our dataset contains 200 observations of 5 variables. It provides some basic data about the shop customers like Customer ID, age, gender, annual income and spending score.
The unit of observation in our case is one individual customer.
The sample size is 200.

Variables and definitions:

Variable name	Type	Description
Customer ID	Categorical (Nominal)	Unique identifier for each customer
Gender	Categorical (Nominal)	Customer’s gender
Age	Numerical (Ratio)	Customer’s age in years
Annual income	Numerical (Ratio)	Customer’s annual income in thousands of USD
Spending score	Numerical (Interval)	Score based on customer spending habits (1–100 scale)

Source of the data

The data was obtained from kaggle.

Link: Mall Customer Segmentation Data

Data manipulation I

Renaming columns

The original column names are somewhat awkward (e.g., “Annual.Income..k..”), therefore we use names(customers)[...] <- ... to give them more readable titles.

names(customers)[names(customers) == "CustomerID"] <- "Customer ID"
names(customers)[names(customers) == "Annual.Income..k.."] <- "Annual Income (USD in thousands)"
names(customers)[names(customers) == "Spending.Score..1.100."] <- "Spending Score (1-100)"
colnames(customers)

## [1] "Customer ID"                      "Gender"                          
## [3] "Age"                              "Annual Income (USD in thousands)"
## [5] "Spending Score (1-100)"

Examining the structure

Let’s examine the structure of our dataset with str() that shows how each column is stored (numeric, factor, character etc.).

str(customers)

## 'data.frame':    200 obs. of  5 variables:
##  $ Customer ID                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                          : chr  "Male" "Male" "Female" "Female" ...
##  $ Age                             : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual Income (USD in thousands): int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending Score (1-100)          : int  39 81 6 77 40 76 6 94 3 72 ...

Let’s define Gender as a factor with two levels, “Male” and “Female” to help R understand that “Gender” is categorical, useful for summary statistics and plotting.

customers$Gender <- factor(customers$Gender, levels = c("Male", "Female"))
str(customers)

## 'data.frame':    200 obs. of  5 variables:
##  $ Customer ID                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                          : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 2 2 2 1 2 ...
##  $ Age                             : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual Income (USD in thousands): int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending Score (1-100)          : int  39 81 6 77 40 76 6 94 3 72 ...

Checking and removing missing data

sum(is.na(customers))

## [1] 0

We see that there is no NAs in our dataset, but for the sake of practice let’s clean the data.

cat("Number of rows before removing NAs:", nrow(customers), "\n")

## Number of rows before removing NAs: 200

customers <- na.omit(customers)
cat("Number of rows after removing NAs:", nrow(customers), "\n")

## Number of rows after removing NAs: 200

Descriptive statistics

We load the psych package to use its describe() function, which returns stats like mean, SD, median, min, and max etc.

library(psych)
describe(customers[ , -c(1,2)])

The average age of customers is 38.9 years old.
Mean annual income is 60.6 thousand USD.
From skew we can say that the distribution of both age and annual income is skewed to the right.
Median for income is 61.5 thousand USD, meaning that 50% of the customers earn ≤61.5k and 50% earn >61.5k USD.

library(pastecs)
round(stat.desc(customers[ , -c(1,2)]), 2)

Let’s use stat.desc() from pastecs, which also offers a wide range of descriptive stats.

summary(customers[ , -1])

##     Gender         Age        Annual Income (USD in thousands)
##  Male  : 88   Min.   :18.00   Min.   : 15.00                  
##  Female:112   1st Qu.:28.75   1st Qu.: 41.50                  
##               Median :36.00   Median : 61.50                  
##               Mean   :38.85   Mean   : 60.56                  
##               3rd Qu.:49.00   3rd Qu.: 78.00                  
##               Max.   :70.00   Max.   :137.00                  
##  Spending Score (1-100)
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00

We could also use the base R function summary(), which gives min, 1st quartile, median, mean, 3rd quartile, and max for each numeric variable. We keep Gender to see how many rows have each factor (88 males and 112 females).

tapply(customers$`Annual Income (USD in thousands)`, customers$Gender, summary)

## $Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   45.50   62.50   62.23   78.00  137.00 
## 
## $Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   39.75   60.00   59.25   77.25  126.00

Here, we compute summary statistics for annual income grouped by Gender.

This allows us to compare incomes between males and females quickly and we can see that the male part of the customers has a higher annual income on average than female (62.23 vs 59.25).

Data manipulation II

Let’s create a new categorical variable called IncomeCategory based on annual income.

library(dplyr)
customers <- customers %>%
  mutate(IncomeCategory = ifelse(`Annual Income (USD in thousands)` < 30, "Low", 
                          ifelse(`Annual Income (USD in thousands)` <= 90, "Medium", "High")))

customers$IncomeCategory <- as.factor(customers$IncomeCategory)

table(customers$IncomeCategory)

## 
##   High    Low Medium 
##     22     30    148

prop.table(table(customers$IncomeCategory))

## 
##   High    Low Medium 
##   0.11   0.15   0.74

Here, we categorize “Low” (< $30k), “Medium” (between $30k and $90k), and “High” (above $90k).

We can see that most of the customers (74%) fall into “Medium” income category.

Hypothesis testing

Research question

Management of city central mall has implemented a new metric “Spending Score” (ranging from 1 to 100) that summarizes a customer’s purchasing frequency, average basket size etc. After collecting data on 200 shoppers, management wants to see if resources should be allocated differently to marketing campaigns targeting male vs female shoppers.

Thus, the key research question is:

“Do male and female customers at city central mall differ in their average spending score?”

Formulating the hypothesis

To address the research question, we set up a formal hypothesis test comparing two population means (independent samples):

Null Hypothesis ($H_0$):
$\mu_\text{Male} = \mu_\text{Female}$
In other words, there is no difference in the mean Spending Score between male and female customers.
Alternative Hypothesis ($H_1$):
$\mu_\text{Male} \ne \mu_\text{Female}$
There IS a difference in the mean Spending Score between male and female customers.

This is a two-sided test.

Checking test assumptions

To use a two-sample independent t-test, we generally assume:

Variable is numeric
The distribution of the variable is normal in both populations
Independent samples
Variable has the same variance in both populations – if no, we apply Welch correction.

Assumptions 1 and 3 are true in our scenario. Let’s check for other ones.

Normality check

As discussed in the lectures, it is not advised to conduct the Shapiro-Wilk Test on big samples, since it is extremely sensitive to small deviations. However, for the sake of practice, let’s conduct one.

by(customers$`Spending Score (1-100)`, customers$Gender, shapiro.test)

## customers$Gender: Male
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.95218, p-value = 0.002627
## 
## ------------------------------------------------------------ 
## customers$Gender: Female
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.97438, p-value = 0.02977

The null hypothesis is that the sample belongs to a normal distribution. The alternative hypothesis is that the sample DOES NOT belong to a normal distribution.

Here, we would reject the null hypothesis in both cases (p-values are less than 0.05 in both results). However, this is not the case with our big dataset (200 observations).

Let’s visualize the distributions instead.

library(ggplot2)
library(ggpubr)

hist_male <- ggplot(
  data = subset(customers, Gender == "Male"),
  aes(x = `Spending Score (1-100)`)
) +
  geom_histogram(
    bins = 5, 
    alpha = 0.7, 
    color = "black", 
    fill = "blue"
  ) +
  theme_minimal() +
  labs(
    title = "Male",
    x     = "Spending Score (1-100)",
    y     = "Frequency"
  )

hist_female <- ggplot(
  data = subset(customers, Gender == "Female"),
  aes(x = `Spending Score (1-100)`)
) +
  geom_histogram(
    bins = 5, 
    alpha = 0.7, 
    color = "black", 
    fill = "red"
  ) +
  theme_minimal() +
  labs(
    title = "Female",
    x     = "Spending Score (1-100)",
    y     = "Frequency"
  )

ggarrange(hist_male, hist_female, ncol = 2, nrow = 1)

library(ggpubr)
library(ggplot2)

ggqqplot(
  data = customers,
  x = "`Spending Score (1-100)`",
  color = "Gender",
  fill  = "Gender",
  facet.by = "Gender",
  legend = "none",                 
  ggtheme = theme_minimal()
) +
  scale_color_manual(values = c("Male" = "blue", "Female" = "red")) +
  scale_fill_manual(values  = c("Male" = "blue", "Female" = "red")) +
  labs(
    title = "Q-Q plots for Spending Score by Gender",
    x     = "Theoretical",
    y     = "Sample"
  )

library(psych)
describeBy(customers$`Spending Score (1-100)`, customers$Gender)

## 
##  Descriptive statistics by group 
## group: Male
##    vars  n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 88 48.51 27.9     50   48.46 34.1   1  97    96 -0.06    -1.04 2.97
## ------------------------------------------------------------ 
## group: Female
##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 112 51.53 24.11     50   51.61 27.43   5  99    94 0.03    -0.81 2.28

Based on the visualizations, especially the Q–Q plots, we conclude that the variable is normally distributed in both samples.

Variance homogeneity check

library(car)
leveneTest(`Spending Score (1-100)` ~ Gender, data = customers)

In Levene’s Test for Homogeneity of Variance, the null hypothesis is that the variances of Spending Scores of male and female are equal. The alternative hypothesis is that variances are NOT equal for male and female.

The p-value is > 0.05, so we fail to reject the null hypothesis. We can assume equal variances.

Two-sample t-Test

t.test(
  `Spending Score (1-100)` ~ Gender, 
  data = customers,
  var.equal = TRUE,   
  alternative = "two.sided"
)

## 
##  Two Sample t-test
## 
## data:  Spending Score (1-100) by Gender
## t = -0.81905, df = 198, p-value = 0.4137
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -10.275652   4.244808
## sample estimates:
##   mean in group Male mean in group Female 
##             48.51136             51.52679

library(effectsize)
effectsize::cohens_d(customers$`Spending Score (1-100)` ~ customers$Gender, 
                     pooled_sd = FALSE)

interpret_cohens_d(0.12, rules = "sawilowsky2009")

## [1] "very small"
## (Rules: sawilowsky2009)

The p-value is far greater than 0.05 (0.4137), therefore we fail to reject $H_0$ – no evidence of a difference.

Wilcoxon Rank Sum Test

For the sake of practice, let’s assume that the normality is violated, therefore we decide to perform the alternative nonparametric test – Wilcoxon Rank Sum Test.

In this case, the hypotheses formulation would be a bit different:

Null Hypothesis ($H_0$):
The distribution location of Spending Score is the same for Male and Female.
Alternative Hypothesis ($H_1$):
The distribution location of Spending Score is NOT the same for Male and Female.

wilcox.test(customers$`Spending Score (1-100)` ~ customers$Gender,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  customers$`Spending Score (1-100)` by customers$Gender
## W = 4697.5, p-value = 0.5704
## alternative hypothesis: true location shift is not equal to 0

library(effectsize)
effectsize(wilcox.test(customers$`Spending Score (1-100)` ~ customers$Gender,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))

interpret_rank_biserial(0.05)

## [1] "very small"
## (Rules: funder2019)

We can observe the p-value greater than 0.05 (0.5704), therefore we fail to reject the null hypothesis and the difference in distribution locations is very small.

Conclusion

Given our results, the parametric (two-sample) t-test is appropriate in this scenario, because our sample is reasonably large (n=200), and none of the assumtions were violated – Spending Score is a numeric variable, the distribution of the variable is normal in both populations, both samples are independent, and the variances for male and female are equal.

The t-test yielded a p-value much greater than 0.05, leading us to fail to reject the null hypothesis ($\mu_\text{Male} = \mu_\text{Female}$). Based on our sample data, we conclude that men and women do not differ in how much they spend in the mall, on average. In other words, there is no statistically significant difference in average Spending Score between male and female customers. The effect size (Cohen’s d) was also very small (~0.12).

From a business standpoint, there appears to be no justification for changing marketing resource allocation strictly on the basis of customer gender.