Research Question: Do male and female customers have different average credit limits?
First, the dataset has to be imported into R Studio. As it is a csv file, the read.csv function can be used. I also limit the dataset to only 3 variables and 500 observations:
Bdata <- read.csv("BankChurners.csv")[1:500, c('CLIENTNUM', 'Gender', 'Credit_Limit')]
Then, it is possible to check how the data looks like:
head(Bdata)
## CLIENTNUM Gender Credit_Limit
## 1 768805383 M 12691
## 2 818770008 F 8256
## 3 713982108 M 3418
## 4 769911858 F 3313
## 5 709106358 M 4716
## 6 713061558 M 4010
str(Bdata)
## 'data.frame': 500 obs. of 3 variables:
## $ CLIENTNUM : int 768805383 818770008 713982108 769911858 709106358 713061558 810347208 818906208 710930508 719661558 ...
## $ Gender : chr "M" "F" "M" "F" ...
## $ Credit_Limit: num 12691 8256 3418 3313 4716 ...
The dataset consists of 500 customers of a bank and includes their individual credit limits. Each row represents an individual credit card user, which is a unit of observation here. The dataset contains 500 observations (customers), therefore, the sample size is 500.
Variables Description
"M", "F"The dataset used in this analysis was sourced from Kaggle, an online platform for dataset sharing: https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers
Now, the missing values have to be identified and removed:
Bdata <- na.omit(Bdata)
I also change the name of the first column:
colnames(Bdata)[1] <- "Client_Number"
head(Bdata)
## Client_Number Gender Credit_Limit
## 1 768805383 M 12691
## 2 818770008 F 8256
## 3 713982108 M 3418
## 4 769911858 F 3313
## 5 709106358 M 4716
## 6 713061558 M 4010
It would be fitting to add a categorical variable for the clients’ credit limit:
Bdata <- cbind(Bdata, rep(NA, nrow(Bdata)))
colnames(Bdata)[4] <- "Credit_Limit_Level"
for (i in 1:nrow(Bdata)) {
if (Bdata[i,3] < 10000) {Bdata[i,4] <- "Low"}
else {if (Bdata[i,3] < 25000) {Bdata[i,4] <- "Medium"}
else {Bdata[i,4] <- "High"}}
}
Now I convert the Gender and Credit_Limit_Level variables into factors:
Bdata$Gender <- factor(Bdata$Gender,
levels = c("M", "F"))
Bdata$Credit_Limit_Level <- factor(Bdata$Credit_Limit_Level,
levels = c("Low", "Medium", "High"))
It is possible to create a new data frame with only male clients who have high credit limit.
Bdata2 <- Bdata[Bdata$Gender=="M" & Bdata$Credit_Limit_Level=="High",]
head(Bdata2)
## Client_Number Gender Credit_Limit Credit_Limit_Level
## 7 810347208 M 34516 High
## 8 818906208 M 29081 High
## 17 709967358 M 30367 High
## 41 827111283 M 32426 High
## 46 712661433 M 34516 High
## 62 712030833 M 34516 High
Library psych has to be activated now:
library(psych)
describe(Bdata[,"Credit_Limit"])
## vars n mean sd median trimmed mad min max range skew
## X1 1 500 10554.72 9835.31 6275.5 8853.45 5768.06 1438.3 34516 33077.7 1.21
## kurtosis se
## X1 0.31 439.85
The average (mean) credit limit of the bank’s customers was
10554.72 USD, while the median was significantly lower
at 6275.5 USD, indicating that half of clients have the
credit limit less than or equal to that value, while the credit limit of
another half is more than that.
The standard deviation of the credit limit amount is 9835.31
USD, suggesting that the clients’ credit limits are
characterized by high variation. The lowest credit limit in the sample
is equal to 1438.3 USD, while the highest credit limit
is 34516 USD.
Research Question: Do male and female customers have different average credit limits?
The data consists of independent samples — data belongs to two different groups of units (male and female).
Assumptions for the parametric t-test have to be checked now:
Numeric variable
The distribution of the variable is normal in both populations
The data must come from two independent populations
Variable has the same variance in both populations
The credit limit variable is numeric, so the first assumption is met.
Library ggplot2 has to be activated now:
library(ggplot2)
The histograms of credit limit distribution for male and female clients can be used to assess normality:
Male <- ggplot(Bdata[Bdata$Gender=="M", ], aes(x=Credit_Limit)) +
geom_histogram(binwidth=2000, fill="blue4", col="darkgrey") +
ylab("Frequency") +
ggtitle("Male Clients")
Female <- ggplot(Bdata[Bdata$Gender=="F", ], aes(x=Credit_Limit)) +
geom_histogram(binwidth=2000, fill="deeppink3", col="darkgrey") +
ylab("Frequency") +
ggtitle("Female Clients")
Library ggpubr has to be activated now:
library(ggpubr)
ggarrange(Male, Female, ncol=2, nrow=1)
It can already be concluded that the normality assumption is heavily violated for both male and female clients. For male customers the distribution is multimodal. Most male clients have relatively low credit limits, but there is also a significant group with very high limits. The distribution for female customers is clearly right-skewed. Most values are concentrated near the lower end (0–5000), with a sharp decline in frequency as credit limits increase. Some rare occurrences exist at the higher end (20000–35000), but they are very infrequent.
It is also possible to perform the Shapiro-Wilk normality test:
library(rstatix)
library(dplyr)
Bdata %>% group_by(Gender) %>% shapiro_test(Credit_Limit)
## # A tibble: 2 × 4
## Gender variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 M Credit_Limit 0.868 9.59e-16
## 2 F Credit_Limit 0.722 2.28e-17
The Shapiro-Wilk test shows significant deviation from normality (p < 0.005 for both male and female).
I also check whether the variances of male and female clients’ credit limits are equal:
library(car)
leveneTest(Bdata$Credit_Limit, Bdata$Gender)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 75.917 < 2.2e-16 ***
## 498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-value, the null hypotheses that the variances are equal can be rejected.
If the assumptions of normality and equal variances were not
violated, I would opt for and independent samples t-test
(t.test(Bdata$Credit_Limit ~ Bdata$Gender, var.equal=TRUE, alternative="two.sided")).
However, the assumptions are significantly violated, consequently, a non-parametric Wilcoxon rank-sum test should be used.
H0: The distribution locations of credit limits are the same for male and female clients.
H1: The distribution locations of credit limits are different for male and female clients.
wilcox.test(Bdata$Credit_Limit ~ Bdata$Gender,
correct=FALSE,
exact=FALSE,
alternative="two.sided")
##
## Wilcoxon rank sum test
##
## data: Bdata$Credit_Limit by Bdata$Gender
## W = 42578, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Given the results, I reject the null hypothesis that the distribution locations of credit limits are the same for men and women.
Finally, I calculate the effect size:
library(effectsize)
effectsize(wilcox.test(Bdata$Credit_Limit ~ Bdata$Gender,
correct=FALSE,
exact=FALSE,
alternative="two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.46 | [0.37, 0.54]
interpret_rank_biserial(0.46)
## [1] "very large"
## (Rules: funder2019)
The effect size, measured as rank biserial correlation, reveals a very large difference between the two distributions.