mydata <- read.table("./loan.csv",
header = TRUE,
sep = ",",
quote = "\"") #I used quote = "\"" because R has not understood apostrophe correctly
head(mydata)
## age gender occupation education_level marital_status income credit_score loan_status
## 1 32 Male Engineer Bachelor's Married 85000 720 Approved
## 2 45 Female Teacher Master's Single 62000 680 Approved
## 3 28 Male Student High School Single 25000 590 Denied
## 4 51 Female Manager Bachelor's Married 105000 780 Approved
## 5 36 Male Accountant Bachelor's Married 75000 710 Approved
## 6 24 Female Nurse Associate's Single 48000 640 Denied
Unit of the observation:
Sample size:
Description of the variables:
age
gender
occupation
education_level
marital_status
income
credit_score
loan_status
Source: Mandala, S. K. (n.d.). Simple Loan Classification
Dataset. Kaggle. Retrieved 24 March 2025, from https://www.kaggle.com/datasets/sujithmandala/simple-loan-classification-dataset
# install.packages("dplyr")
# install.packages("tidyr")
library(dplyr)
library(tidyr)
mydata <- mydata %>%
rename(Income_USD = income, Age = age, Gender = gender, Occupation = occupation, Education_level = education_level, Marital_status = marital_status, Credit_score = credit_score, Loan_status = loan_status) %>%
drop_na()
# Adding a new variable called "ID"
mydata$ID <- c(1:61)
mydata <- mydata[ , c(9, 1:8)] # Moves column 10 to the front
# Removing columns that are not important for the future research question
mydata <- mydata[ , -c(2, 4:6, 8:9)]
head(mydata)
## ID Gender Income_USD
## 1 1 Male 85000
## 2 2 Female 62000
## 3 3 Male 25000
## 4 4 Female 105000
## 5 5 Male 75000
## 6 6 Female 48000
library(psych)
describeBy(mydata$Income_USD, group = mydata$Gender) # Dropping IDs, because it makes no sense doing descriptive statistics for this variable
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 30 77233.33 40331.97 66500 71958.33 34841.1 28000 180000 152000 0.97 0.19 7363.58
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 31 80677.42 26507.09 85000 81640 22239 25000 130000 105000 -0.34 -0.68 4760.81
Income comparison between male and female based on the given data shows that:
Required assumptions:
# install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = Income_USD)) +
geom_histogram(binwidth = 15000, colour="gray") +
facet_wrap(~Gender, ncol = 1) +
ylab("Frequency") +
xlab("Annual Income in USD")
Histograms interpretation:
# install.packages("ggpubr")
library(ggpubr)
ggqqplot(mydata,
"Income_USD",
facet.by = "Gender")
Q-Q Plot Interpretation:
# install.packages("rstatix")
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(dplyr)
mydata %>%
group_by(Gender) %>%
shapiro_test(Income_USD)
## # A tibble: 2 × 4
## Gender variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Female Income_USD 0.903 0.00969
## 2 Male Income_USD 0.976 0.707
Hypotheses for Shapiro-Wilk Test:
Conclusions:
# install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$Income_USD, group = mydata$Gender)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.0923 0.08385 .
## 59
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypotheses for Levene's Test:
Conclusions:
t.test(mydata$Income_USD ~ mydata$Gender,
var.equal = TRUE,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: mydata$Income_USD by mydata$Gender
## t = -0.39538, df = 59, p-value = 0.694
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -20874.26 13986.09
## sample estimates:
## mean in group Female mean in group Male
## 77233.33 80677.42
Hypotheses for Independent Samples T-Test:
Conclusion:
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydata$Income_USD ~ mydata$Gender,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## -------------------------
## -0.10 | [-0.60, 0.40]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.10, rules = "sawilowsky2009")
## [1] "very small"
## (Rules: sawilowsky2009)
Conclusion for the parametric test:
Based on the sample data, we found that the average
annual income between man and women does not differ (p >
0.05), the difference in average income is very small
(the effect size is very small, d=0.10)
However, this result may be unreliable due to violated
assumption about normality
Therefore, corresponding non-parametric test
(e.g. Wilcoxon Rank Sum Test) should be performed
wilcox.test(mydata$Income_USD ~ mydata$Gender,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$Income_USD by mydata$Gender
## W = 392.5, p-value = 0.2954
## alternative hypothesis: true location shift is not equal to 0
Hypotheses for Wilcoxon Rank Sum Test:
Conclusion:
The null hypothesis cannot be rejected (p-value
is greater than 0.05)
This suggests that the distribution of annual income is
not significantly different between males and females
For further precision, the effect size could be calculated
# install.packages("effectsize")
library(effectsize)
effectsize(wilcox.test(mydata$Income_USD ~ mydata$Gender,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.16 | [-0.42, 0.13]
interpret_rank_biserial(0.16)
## [1] "small"
## (Rules: funder2019)
The effect size is small, r = 0.16
The distribution locations of annual income have only small
differences for male and female
Based on the sample data, we found that man and women do
not differ in the annual income (p > 0.05), the
distribution locations have only small differences between both
populations (the effect size is small, r = 0.16)
Both the t-test (parametric) and Wilcoxon test (non-parametric)
led to similar conclusions
However, due to the violation of the normality assumption, the non-parametric test is more reliable and appropriate in this context