ADA HW 2

Homework Assignment 2

Research Question: Do male and female customers have different average credit limits?

First, the dataset has to be imported into R Studio. As it is a csv file, the read.csv function can be used. I also limit the dataset to only 3 variables and 500 observations:

Bdata <- read.csv("BankChurners.csv")[1:500, c('CLIENTNUM', 'Gender', 'Credit_Limit')]

Then, it is possible to check how the data looks like:

head(Bdata)

##   CLIENTNUM Gender Credit_Limit
## 1 768805383      M        12691
## 2 818770008      F         8256
## 3 713982108      M         3418
## 4 769911858      F         3313
## 5 709106358      M         4716
## 6 713061558      M         4010

str(Bdata)

## 'data.frame':    500 obs. of  3 variables:
##  $ CLIENTNUM   : int  768805383 818770008 713982108 769911858 709106358 713061558 810347208 818906208 710930508 719661558 ...
##  $ Gender      : chr  "M" "F" "M" "F" ...
##  $ Credit_Limit: num  12691 8256 3418 3313 4716 ...

The dataset consists of 500 customers of a bank and includes their individual credit limits. Each row represents an individual credit card user, which is a unit of observation here. The dataset contains 500 observations (customers), therefore, the sample size is 500.

Variables Description

CLIENTNUM (Client Number)
- Type: Categorical Nominal (no meaningful order or magnitude, used to just “label” the clients)
- Definition: Unique identifier for the customer holding the account.
Gender
- Type: Categorical Nominal (no intrinsic order — one isn’t higher or lower than the other)
- Categories: "M", "F"
- Definition: The gender of the customer (male or female).
Credit_Limit (Client’s Credit Limit)
- Type: Numerical Ratio (meaningful zero point (a limit of 0 means no credit); possible to make meaningful ratio comparisons (e.g., one client has double the limit of another))
- Unit of measurement: USD
- Definition: Credit limit on the customer’s credit card.

The dataset used in this analysis was sourced from Kaggle, an online platform for dataset sharing: https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers

Now, the missing values have to be identified and removed:

Bdata <- na.omit(Bdata)

I also change the name of the first column:

colnames(Bdata)[1] <- "Client_Number"
head(Bdata)

##   Client_Number Gender Credit_Limit
## 1     768805383      M        12691
## 2     818770008      F         8256
## 3     713982108      M         3418
## 4     769911858      F         3313
## 5     709106358      M         4716
## 6     713061558      M         4010

It would be fitting to add a categorical variable for the clients’ credit limit:

Bdata <- cbind(Bdata, rep(NA, nrow(Bdata)))
colnames(Bdata)[4] <- "Credit_Limit_Level"
for (i in 1:nrow(Bdata)) {
  if (Bdata[i,3] < 10000) {Bdata[i,4] <- "Low"} 
  else {if (Bdata[i,3] < 25000) {Bdata[i,4] <- "Medium"} 
    else {Bdata[i,4] <- "High"}} 
}

Now I convert the Gender and Credit_Limit_Level variables into factors:

Bdata$Gender <- factor(Bdata$Gender, 
                         levels = c("M", "F"))
Bdata$Credit_Limit_Level <- factor(Bdata$Credit_Limit_Level, 
                            levels = c("Low", "Medium", "High"))

It is possible to create a new data frame with only male clients who have high credit limit.

Bdata2 <- Bdata[Bdata$Gender=="M" & Bdata$Credit_Limit_Level=="High",]
head(Bdata2)

##    Client_Number Gender Credit_Limit Credit_Limit_Level
## 7      810347208      M        34516               High
## 8      818906208      M        29081               High
## 17     709967358      M        30367               High
## 41     827111283      M        32426               High
## 46     712661433      M        34516               High
## 62     712030833      M        34516               High

Library psych has to be activated now:

library(psych)

describe(Bdata[,"Credit_Limit"])

##    vars   n     mean      sd median trimmed     mad    min   max   range skew
## X1    1 500 10554.72 9835.31 6275.5 8853.45 5768.06 1438.3 34516 33077.7 1.21
##    kurtosis     se
## X1     0.31 439.85

The average (mean) credit limit of the bank’s customers was 10554.72 USD, while the median was significantly lower at 6275.5 USD, indicating that half of clients have the credit limit less than or equal to that value, while the credit limit of another half is more than that.
The standard deviation of the credit limit amount is 9835.31 USD, suggesting that the clients’ credit limits are characterized by high variation. The lowest credit limit in the sample is equal to 1438.3 USD, while the highest credit limit is 34516 USD.

Research Question: Do male and female customers have different average credit limits?

The data consists of independent samples — data belongs to two different groups of units (male and female).

Assumptions for the parametric t-test have to be checked now:

Numeric variable
The distribution of the variable is normal in both populations
The data must come from two independent populations
Variable has the same variance in both populations

The credit limit variable is numeric, so the first assumption is met.

Library ggplot2 has to be activated now:

library(ggplot2)

The histograms of credit limit distribution for male and female clients can be used to assess normality:

Male <- ggplot(Bdata[Bdata$Gender=="M", ], aes(x=Credit_Limit)) + 
  geom_histogram(binwidth=2000, fill="blue4", col="darkgrey") +
  ylab("Frequency") +
  ggtitle("Male Clients")

Female <- ggplot(Bdata[Bdata$Gender=="F", ], aes(x=Credit_Limit)) + 
  geom_histogram(binwidth=2000, fill="deeppink3", col="darkgrey") +
  ylab("Frequency") +
  ggtitle("Female Clients")

Library ggpubr has to be activated now:

library(ggpubr)

ggarrange(Male, Female, ncol=2, nrow=1)

It can already be concluded that the normality assumption is heavily violated for both male and female clients. For male customers the distribution is multimodal. Most male clients have relatively low credit limits, but there is also a significant group with very high limits. The distribution for female customers is clearly right-skewed. Most values are concentrated near the lower end (0–5000), with a sharp decline in frequency as credit limits increase. Some rare occurrences exist at the higher end (20000–35000), but they are very infrequent.

It is also possible to perform the Shapiro-Wilk normality test:

library(rstatix)
library(dplyr)

Bdata %>% group_by(Gender) %>% shapiro_test(Credit_Limit)

## # A tibble: 2 × 4
##   Gender variable     statistic        p
##   <fct>  <chr>            <dbl>    <dbl>
## 1 M      Credit_Limit     0.868 9.59e-16
## 2 F      Credit_Limit     0.722 2.28e-17

The Shapiro-Wilk test shows significant deviation from normality (p < 0.005 for both male and female).

I also check whether the variances of male and female clients’ credit limits are equal:

library(car)

leveneTest(Bdata$Credit_Limit, Bdata$Gender)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   1  75.917 < 2.2e-16 ***
##       498                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-value, the null hypotheses that the variances are equal can be rejected.

If the assumptions of normality and equal variances were not violated, I would opt for and independent samples t-test (t.test(Bdata$Credit_Limit ~ Bdata$Gender, var.equal=TRUE, alternative="two.sided")).

However, the assumptions are significantly violated, consequently, a non-parametric Wilcoxon rank-sum test should be used.

H0: The distribution locations of credit limits are the same for male and female clients.

H1: The distribution locations of credit limits are different for male and female clients.

wilcox.test(Bdata$Credit_Limit ~ Bdata$Gender,
            correct=FALSE,
            exact=FALSE,
            alternative="two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  Bdata$Credit_Limit by Bdata$Gender
## W = 42578, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Given the results, I reject the null hypothesis that the distribution locations of credit limits are the same for men and women.

Finally, I calculate the effect size:

library(effectsize)

effectsize(wilcox.test(Bdata$Credit_Limit ~ Bdata$Gender,
            correct=FALSE,
            exact=FALSE,
            alternative="two.sided"))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.46              | [0.37, 0.54]

interpret_rank_biserial(0.46)

## [1] "very large"
## (Rules: funder2019)

The effect size, measured as rank biserial correlation, reveals a very large difference between the two distributions.

ADA HW 2

Anna Kostiukovych

2025-03-29

Anna Kostiukovych

Homework Assignment 2