Author: Polina Kamenchuk. Student ID: 12300685
##to upload data set into R.
data <- read.table("./loan_data.csv", ###name of the file
header = TRUE, ###first row as a header
sep = ",", ###columns are separated from each other with ","
dec = ".") ###"." is used in decimal numbers
str(data) ###to see the data structure
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : chr "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
head(data) ###to display the first 6 rows
## credit.policy purpose int.rate installment log.annual.inc dti
## 1 1 debt_consolidation 0.1189 829.10 11.35041 19.48
## 2 1 credit_card 0.1071 228.22 11.08214 14.29
## 3 1 debt_consolidation 0.1357 366.86 10.37349 11.63
## 4 1 debt_consolidation 0.1008 162.34 11.35041 8.10
## 5 1 credit_card 0.1426 102.92 11.29973 14.97
## 6 1 credit_card 0.0788 125.13 11.90497 16.98
## fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs
## 1 737 5639.958 28854 52.1 0 0
## 2 707 2760.000 33623 76.7 0 0
## 3 682 4710.000 3511 25.6 1 0
## 4 712 2699.958 33667 73.2 1 0
## 5 667 4066.000 4740 39.5 0 1
## 6 727 6120.042 50807 51.0 0 0
## pub.rec not.fully.paid
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
DATA DESCRIPTION
data set was taken from the Kaggle website (https://www.kaggle.com/datasets/itssuru/loan-data)
sample contains the users (borrowers) of the platform LendingClub.com (platform where private borrowers and lenders meet) sample size is 9578 observations (it can be seen from data in Environment)
unit of observation is one such borrower
.
.
.
.
description of each variable:
credit.policy: 1 if the customer meets the credit underwriting criteria, and 0 if not, categorical.
purpose: The purpose of the loan, categorical, nominal.
int.rate: The interest rate of the loan, as a proportion, means 10% = 0,10, numerical.
installment: The monthly installments owed by the borrower if the loan is funded in USD, numerical.
log.annual.inc: The natural log of the self-reported annual income of the borrower, numerical.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income, in USD).
fico: The FICO credit score of the borrower, in scores, numerical.
days.with.cr.line: The number of days the borrower has had a credit line, number of days, numerical.
revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle, inUSD).
revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available, , numerical).
inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months, amount, numerical.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years, numerical.
pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
###i dont really understand this fico score and i accume there are enough variables to analyse, so let me remove it from the data set :)
data <- data[, -7 ]
head(data)
## credit.policy purpose int.rate installment log.annual.inc dti
## 1 1 debt_consolidation 0.1189 829.10 11.35041 19.48
## 2 1 credit_card 0.1071 228.22 11.08214 14.29
## 3 1 debt_consolidation 0.1357 366.86 10.37349 11.63
## 4 1 debt_consolidation 0.1008 162.34 11.35041 8.10
## 5 1 credit_card 0.1426 102.92 11.29973 14.97
## 6 1 credit_card 0.0788 125.13 11.90497 16.98
## days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 5639.958 28854 52.1 0 0 0
## 2 2760.000 33623 76.7 0 0 0
## 3 4710.000 3511 25.6 1 0 0
## 4 2699.958 33667 73.2 1 0 0
## 5 4066.000 4740 39.5 0 1 0
## 6 6120.042 50807 51.0 0 0 0
## not.fully.paid
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
calculate_number_of_variables <- function(data) {
num_variables <- ncol(data)
cat("Number of variables in the dataset:", num_variables, "\n")
}
calculate_number_of_variables(data)
## Number of variables in the dataset: 13
and also there are 13 variables which can be also seen from the Environment > data. but also can be calculated with function above
data$credit.policy.F <- factor(data$credit.policy,
levels = c(1, 0),
labels = c("Fulfill", "Otherwise")) ###to convert categorical into factor
data$purposeF <- factor(data$purpose,
levels = c("credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", "all_other", "home_improvement"),
labels = c("credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", "all_other", "home_improvement")) ###to convert categorical into factor
data$not.fully.paidF <- factor(data$not.fully.paid,
levels = c(1, 0),
labels = c("Not Paid", "Paid")) ###to convert categorical into factor
data$revol.balN <- as.numeric(data$revol.bal) ###to convert integer to factor as i would say the amount unpaid is in dollars so in should be numerical
data <- data[ , c(-1, -2, -8, -13) ] ### i would also delete variables before transformation, because they make no sense and we have new correct one
head(data, 15) ### display first 15 rows
## int.rate installment log.annual.inc dti days.with.cr.line revol.util
## 1 0.1189 829.10 11.350407 19.48 5639.958 52.1
## 2 0.1071 228.22 11.082143 14.29 2760.000 76.7
## 3 0.1357 366.86 10.373491 11.63 4710.000 25.6
## 4 0.1008 162.34 11.350407 8.10 2699.958 73.2
## 5 0.1426 102.92 11.299732 14.97 4066.000 39.5
## 6 0.0788 125.13 11.904968 16.98 6120.042 51.0
## 7 0.1496 194.02 10.714418 4.00 3180.042 76.8
## 8 0.1114 131.22 11.002100 11.08 5116.000 68.6
## 9 0.1134 87.19 11.407565 17.25 3989.000 51.1
## 10 0.1221 84.12 10.203592 10.00 2730.042 23.0
## 11 0.1347 360.43 10.434116 22.09 6713.042 71.0
## 12 0.1324 253.58 11.835009 9.16 4298.000 18.2
## 13 0.0859 316.11 10.933107 15.49 6519.958 16.7
## 14 0.0714 92.82 11.512925 6.50 4384.000 4.8
## 15 0.0863 209.54 9.487972 9.73 1559.958 44.6
## inq.last.6mths delinq.2yrs pub.rec credit.policy.F purposeF
## 1 0 0 0 Fulfill debt_consolidation
## 2 0 0 0 Fulfill credit_card
## 3 1 0 0 Fulfill debt_consolidation
## 4 1 0 0 Fulfill debt_consolidation
## 5 0 1 0 Fulfill credit_card
## 6 0 0 0 Fulfill credit_card
## 7 0 0 1 Fulfill debt_consolidation
## 8 0 0 0 Fulfill all_other
## 9 1 0 0 Fulfill home_improvement
## 10 1 0 0 Fulfill debt_consolidation
## 11 2 0 1 Fulfill debt_consolidation
## 12 2 1 0 Fulfill debt_consolidation
## 13 0 0 0 Fulfill debt_consolidation
## 14 0 1 0 Fulfill small_business
## 15 0 0 0 Fulfill debt_consolidation
## not.fully.paidF revol.balN
## 1 Paid 28854
## 2 Paid 33623
## 3 Paid 3511
## 4 Paid 33667
## 5 Paid 4740
## 6 Paid 50807
## 7 Not Paid 3839
## 8 Not Paid 24220
## 9 Paid 69909
## 10 Paid 5630
## 11 Paid 13846
## 12 Paid 5122
## 13 Paid 6068
## 14 Paid 3021
## 15 Paid 6282
str(data) ###structure
## 'data.frame': 9578 obs. of 13 variables:
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ credit.policy.F : Factor w/ 2 levels "Fulfill","Otherwise": 1 1 1 1 1 1 1 1 1 1 ...
## $ purposeF : Factor w/ 7 levels "credit_card",..: 2 1 2 2 1 1 2 6 7 2 ...
## $ not.fully.paidF : Factor w/ 2 levels "Not Paid","Paid": 2 2 2 2 2 2 1 1 2 2 ...
## $ revol.balN : num 28854 33623 3511 33667 4740 ...
###to round data to 2 decimal munbers (just to make it a bit more nice looking)
data$log.annual.inc <- round(data$log.annual.inc, 2)
data$int.rate <- round(data$int.rate, 2)
data$installment <- round(data$installment, 2)
data$dti <- round(data$dti, 2)
data$days.with.cr.line <- round(data$days.with.cr.line, 2)
head(data)
## int.rate installment log.annual.inc dti days.with.cr.line revol.util
## 1 0.12 829.10 11.35 19.48 5639.96 52.1
## 2 0.11 228.22 11.08 14.29 2760.00 76.7
## 3 0.14 366.86 10.37 11.63 4710.00 25.6
## 4 0.10 162.34 11.35 8.10 2699.96 73.2
## 5 0.14 102.92 11.30 14.97 4066.00 39.5
## 6 0.08 125.13 11.90 16.98 6120.04 51.0
## inq.last.6mths delinq.2yrs pub.rec credit.policy.F purposeF
## 1 0 0 0 Fulfill debt_consolidation
## 2 0 0 0 Fulfill credit_card
## 3 1 0 0 Fulfill debt_consolidation
## 4 1 0 0 Fulfill debt_consolidation
## 5 0 1 0 Fulfill credit_card
## 6 0 0 0 Fulfill credit_card
## not.fully.paidF revol.balN
## 1 Paid 28854
## 2 Paid 33623
## 3 Paid 3511
## 4 Paid 33667
## 5 Paid 4740
## 6 Paid 50807
###to find missing values. if there are any, we can remove them with the function "drop_na"
find_missing_values <- function(data) {
missing <- sum(is.na(data))
if(missing > 0) {
cat("Number of missing values in the dataset:", missing, "\n")
cat("Indices of missing values:\n")
print(which(is.na(data), arr.ind = TRUE))
} else {
cat("No missing values found in the dataset.\n")
}
}
find_missing_values(data)
## No missing values found in the dataset.
###hovewer, as we can see, no missing values here
summary(data) ###descriptive statistic for all variables
## int.rate installment log.annual.inc dti
## Min. :0.0600 Min. : 15.67 Min. : 7.55 Min. : 0.000
## 1st Qu.:0.1000 1st Qu.:163.77 1st Qu.:10.56 1st Qu.: 7.213
## Median :0.1200 Median :268.95 Median :10.93 Median :12.665
## Mean :0.1228 Mean :319.09 Mean :10.93 Mean :12.607
## 3rd Qu.:0.1400 3rd Qu.:432.76 3rd Qu.:11.29 3rd Qu.:17.950
## Max. :0.2200 Max. :940.14 Max. :14.53 Max. :29.960
##
## days.with.cr.line revol.util inq.last.6mths delinq.2yrs
## Min. : 179 Min. : 0.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 2820 1st Qu.: 22.6 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 4140 Median : 46.3 Median : 1.000 Median : 0.0000
## Mean : 4561 Mean : 46.8 Mean : 1.577 Mean : 0.1637
## 3rd Qu.: 5730 3rd Qu.: 70.9 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :17640 Max. :119.0 Max. :33.000 Max. :13.0000
##
## pub.rec credit.policy.F purposeF not.fully.paidF
## Min. :0.00000 Fulfill :7710 credit_card :1262 Not Paid:1533
## 1st Qu.:0.00000 Otherwise:1868 debt_consolidation:3957 Paid :8045
## Median :0.00000 educational : 343
## Mean :0.06212 major_purchase : 437
## 3rd Qu.:0.00000 small_business : 619
## Max. :5.00000 all_other :2331
## home_improvement : 629
## revol.balN
## Min. : 0
## 1st Qu.: 3187
## Median : 8596
## Mean : 16914
## 3rd Qu.: 18250
## Max. :1207359
##
summary(data [ , c(-10, -11, -12) ]) ###descriptive statistics without categorical variable
## int.rate installment log.annual.inc dti
## Min. :0.0600 Min. : 15.67 Min. : 7.55 Min. : 0.000
## 1st Qu.:0.1000 1st Qu.:163.77 1st Qu.:10.56 1st Qu.: 7.213
## Median :0.1200 Median :268.95 Median :10.93 Median :12.665
## Mean :0.1228 Mean :319.09 Mean :10.93 Mean :12.607
## 3rd Qu.:0.1400 3rd Qu.:432.76 3rd Qu.:11.29 3rd Qu.:17.950
## Max. :0.2200 Max. :940.14 Max. :14.53 Max. :29.960
## days.with.cr.line revol.util inq.last.6mths delinq.2yrs
## Min. : 179 Min. : 0.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 2820 1st Qu.: 22.6 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 4140 Median : 46.3 Median : 1.000 Median : 0.0000
## Mean : 4561 Mean : 46.8 Mean : 1.577 Mean : 0.1637
## 3rd Qu.: 5730 3rd Qu.: 70.9 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :17640 Max. :119.0 Max. :33.000 Max. :13.0000
## pub.rec revol.balN
## Min. :0.00000 Min. : 0
## 1st Qu.:0.00000 1st Qu.: 3187
## Median :0.00000 Median : 8596
## Mean :0.06212 Mean : 16914
## 3rd Qu.:0.00000 3rd Qu.: 18250
## Max. :5.00000 Max. :1207359
I don’t want to talk about everything, it doesn’t make any sense to me, so I’ll just talk about two variables of choice.
The lowest interest rate that existed in this sample was 0%, the highest was 22%. 1st Qu means that a quarter of people had interest rates of 10% or lower, median means that half of people had interest rates of 12% or lower, and 3rd Qu means that 75% of borrowers had interest rates of 14% or lower and respectively the last quarter of people have interest rates above 14%. “mean” means that the arithmetic average value of the interest rate is 12.28%.
number of days the borrower has had a credit line. the minimum loan period is 179 days, the maximum is 17,640 days. mean means that if we add all the values and divide by the number of observations, then we will find the arithmetic mean, and this arithmetic mean will mean that on average people borrow for 4561 days. 1st Qu means that a quarter of people take a loan for a term of 2820 days or a shorter term. the median means that half of the people borrow for 4,140 days or less. 3rd Qu means that people take debt for 5730 days or less period of time.
###another way to calculate the desired values from descriptive statistics is to use the formulas below. I will calculate only for the borrower's revolving balance variable.
min(data$revol.balN)
## [1] 0
max(data$revol.balN)
## [1] 1207359
mean(data$revol.balN)
## [1] 16913.96
median(data$revol.balN)
## [1] 8596
sd(data$revol.balN)
## [1] 33756.19
var(data$revol.balN)
## [1] 1139480333
range(data$revol.balN)
## [1] 0 1207359
sum(data$revol.balN)
## [1] 162001946
quantile(data$revol.balN)
## 0% 25% 50% 75% 100%
## 0.0 3187.0 8596.0 18249.5 1207359.0
quantile(data$revol.balN, p=0.10)
## 10%
## 710.7
min: the smallest unpaid amount is $0 (i.e. everything is paid)
max: the maximum unpaid amount is $1,207,359
average: The average outstanding amount of the borrower in the sample is $16,913.96.
median: half of the borrowers did not pay the amount unpaid at the end of the credit card billing cycle $8,596 or less.
sd: it means that the values are, on average, 33756.19 units ($) away from the mean.
range: this is the difference between the largest and smallest values. that is, the unpaid amount is the difference between the borrowers by $1,207,359
amount: the total amount of all unpaid funds from all borrowers is $16,200,1946
quantile: A quarter of people did not pay $3,187 or less. Exactly half of the borrowers did not pay $8,596 or less. And 75% of people did not pay $18,249.5 or less
quantile, 10%: This means that 10% of people did not pay an amount of $710.7 or less.
#install.packages("pastecs")
library(pastecs)
round(stat.desc(data[ , c(-10, -11, -12) ]), 2)
## int.rate installment log.annual.inc dti days.with.cr.line
## nbr.val 9578.00 9578.00 9578.00 9578.00 9578.00
## nbr.null 0.00 0.00 0.00 89.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 0.06 15.67 7.55 0.00 178.96
## max 0.22 940.14 14.53 29.96 17639.96
## range 0.16 924.47 6.98 29.96 17461.00
## sum 1175.73 3056238.40 104710.00 120746.77 43683027.92
## median 0.12 268.95 10.93 12.66 4139.96
## mean 0.12 319.09 10.93 12.61 4560.77
## SE.mean 0.00 2.12 0.01 0.07 25.51
## CI.mean.0.95 0.00 4.15 0.01 0.14 50.01
## var 0.00 42878.52 0.38 47.39 6234661.48
## std.dev 0.03 207.07 0.61 6.88 2496.93
## coef.var 0.22 0.65 0.06 0.55 0.55
## revol.util inq.last.6mths delinq.2yrs pub.rec revol.balN
## nbr.val 9578.00 9578.00 9578.00 9578.00 9.578000e+03
## nbr.null 297.00 3637.00 8458.00 9019.00 3.210000e+02
## nbr.na 0.00 0.00 0.00 0.00 0.000000e+00
## min 0.00 0.00 0.00 0.00 0.000000e+00
## max 119.00 33.00 13.00 5.00 1.207359e+06
## range 119.00 33.00 13.00 5.00 1.207359e+06
## sum 448243.08 15109.00 1568.00 595.00 1.620019e+08
## median 46.30 1.00 0.00 0.00 8.596000e+03
## mean 46.80 1.58 0.16 0.06 1.691396e+04
## SE.mean 0.30 0.02 0.01 0.00 3.449200e+02
## CI.mean.0.95 0.58 0.04 0.01 0.01 6.761100e+02
## var 841.84 4.84 0.30 0.07 1.139480e+09
## std.dev 29.01 2.20 0.55 0.26 3.375619e+04
## coef.var 0.62 1.39 3.34 4.22 2.000000e+00
###The stat.desc function also calculates basic statistics. round is needed to round the value to two decimal places so that the results look better. I also removed categorical variables, because their statistical description does not make any sense
Here, too, I will explain only one variable - installment.
nbr.val 9578.00 is the number of observations that have a value, that is, which are not missing or not zero. in this particular case, it is all 9578 observations
nbr.null 0.00 and nbr.na 0.00 - this means that we have neither zeros nor missing data in the installment data
min 15.67 - that is, the minimum monthly payment is $15.67
max 940.14 - this means that the borrower’s maximum monthly payment is $940.14
range 924.47 means that the difference between the largest and smallest monthly payments of the sample is $924.47
sum 3056238.40 - that is, for every month all members of the sample taken together must make a payment of $3056238.40 per month.
median 268.95 means that half of borrowers pay $268.95 or less each month.
mean 319.09 means that on average the borrower pays $319.0 per month
library(psych)
describe(data [ , c(-10, -11, -12) ]) ###another function to calculate statistics, without categorical variables as well
## vars n mean sd median trimmed mad min
## int.rate 1 9578 0.12 0.03 0.12 0.12 0.03 0.06
## installment 2 9578 319.09 207.07 268.95 295.64 184.88 15.67
## log.annual.inc 3 9578 10.93 0.61 10.93 10.93 0.55 7.55
## dti 4 9578 12.61 6.88 12.66 12.59 7.98 0.00
## days.with.cr.line 5 9578 4560.77 2496.93 4139.96 4303.64 2135.06 178.96
## revol.util 6 9578 46.80 29.01 46.30 46.50 35.88 0.00
## inq.last.6mths 7 9578 1.58 2.20 1.00 1.16 1.48 0.00
## delinq.2yrs 8 9578 0.16 0.55 0.00 0.02 0.00 0.00
## pub.rec 9 9578 0.06 0.26 0.00 0.00 0.00 0.00
## revol.balN 10 9578 16913.96 33756.19 8596.00 10809.19 9619.11 0.00
## max range skew kurtosis se
## int.rate 0.22 0.16 0.17 -0.21 0.00
## installment 940.14 924.47 0.91 0.14 2.12
## log.annual.inc 14.53 6.98 0.03 1.61 0.01
## dti 29.96 29.96 0.02 -0.90 0.07
## days.with.cr.line 17639.96 17461.00 1.16 1.94 25.51
## revol.util 119.00 119.00 0.06 -1.12 0.30
## inq.last.6mths 33.00 33.00 3.58 26.27 0.02
## delinq.2yrs 13.00 13.00 6.06 71.38 0.01
## pub.rec 5.00 5.00 5.12 38.75 0.00
## revol.balN 1207359.00 1207359.00 11.16 259.46 344.92
here I propose to analyze the pub.rec variable, i.e. The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
This is variable (var) number 9 in the output table.
n means the number of observations, which are always 9578
mean, as always, means the arithmetic mean, that is, on average, borrowers have 0.06 public records for each.
sd stands for standard deviation, i.e, it means that the values are, on average, 0,26 units (number of records) away from the mean
median of 0.00 means that half of the people have 0 public records
trimmed refers to a dataset from which a certain proportion of the data points have been removed from one or both ends of the distribution. This is done to reduce the influence of extreme values (outliers) on statistical measures. As we can see, there are no outliers in here.
the minimum and maximum values show the largest and smallest number of public records the borrower has, in this case 0 and 5, respectively. The range shows the spread of the data, what is their spectrum, it is max-min=5-0=5 units.
skewness (skew) means asymmetry of the distribution to the left or to the right from the mean. in this case there is a positive skew, means there is a longer “tail” om the right-hand side.
kurtosis refers to the measure of the tailedness or peakedness of a distribution.in our case, this means heavier tails and a sharper peak compared to a normal distribution
describeBy(data$dti, group = data$purpose)
##
## Descriptive statistics by group
## group: credit_card
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1262 14.1 6.47 14.38 14.21 7.46 0 29.95 29.95 -0.11 -0.8 0.18
## ------------------------------------------------------------
## group: debt_consolidation
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3957 14.08 6.43 14.24 14.18 7.35 0 29.96 29.96 -0.09 -0.77 0.1
## ------------------------------------------------------------
## group: educational
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 343 11.34 6.94 11.42 11.15 8.29 0 29.74 29.74 0.19 -0.87 0.37
## ------------------------------------------------------------
## group: major_purchase
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 437 10.16 6.63 9.51 9.91 7.78 0 26.15 26.15 0.26 -0.96 0.32
## ------------------------------------------------------------
## group: small_business
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 619 10.79 6.93 10.39 10.53 8.09 0 29.21 29.21 0.26 -0.92 0.28
## ------------------------------------------------------------
## group: all_other
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2331 11.08 7.1 10.56 10.84 8.51 0 29.9 29.9 0.23 -0.92 0.15
## ------------------------------------------------------------
## group: home_improvement
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 629 10.2 6.78 9.66 9.82 8.04 0 28.17 28.17 0.38 -0.75 0.27
this function calculated the strategy for the variable “dti” for each category from the categorical variable “purpose” separately.
it is more convenient to compare these data through knit. but actually, the output will be in this format anyway
so, we can see that, for example, the most popular purpose is debt_consolidation, with 3957 observations in this category. followed by all_other and credit_card with values of 2331 and 1262 respectively. the least popular goal, on the other hand, is educational, with only 343 observations.
the average debt-to-income ratio also varies by category. well, for credit_card, for example, this ratio is 14.1, meaning that, on average, the debt of those who borrow money for a credit card is 14.1 times higher than their income. This, by the way, is also the highest ratio instead, for example, for those who borrow money for educational purposes, the amount of debt exceeds their income by 11.34 times. proportion has the lowest value for major_purchase and home_improvement and is 10.16 and 10.20, respectively.
the two highest median values are 14.38 and 14.24 for credit_card and debt_consolidation, respectively, which means that half of people who borrow for credit cards have a debt-to-income ratio of 14.38:1 or less, and half of those who borrow for debt consolidation, have a debt-to-income ratio of 14.24:1 or less. By “less” I mean that there are fewer borrowed dollars per $1 of income.
describeBy(data$revol.balN, group = data$credit.policy.F)
##
## Descriptive statistics by group
## group: Fulfill
## vars n mean sd median trimmed mad min max range skew
## X1 1 7710 13798.4 16878.56 8707.5 10514.1 9423.41 0 149527 149527 2.97
## kurtosis se
## X1 12.54 192.22
## ------------------------------------------------------------
## group: Otherwise
## vars n mean sd median trimmed mad min max range
## X1 1 1868 29773.15 66807.57 8039.5 14032.46 10343.36 0 1207359 1207359
## skew kurtosis se
## X1 6.64 79.16 1545.74
so here’s another example of that. how to calculate statistics for each category separately. here such statistics are calculated for the quantitative variable “revol.balN”, which means The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). It is grouped by categories, whether the customer meets the credit underwriting criteria of the platform or not.
let’s analyze here the values that I did not analyze in the previous example.
therefore, the minimum and maximum amount of outstanding debt for those who meet the criteria is 0 USD and 149,527 USD, respectively. While for those who do not meet the criteria, the minimum and maximum values are 0 USD and 1207359 USD respectively.
both skewnesses have a positive value: 2.97 for those who meet the criteria and 6.64 for those who do not meet the criteria. positive skew means that the right “tail” is longer.
kurtosis in both cases is positive, so the distribution deviates from normal upwards. moreover, for those who do not meet the criteria, the value is larger (79.16 vs 12.54), that is, the distribution of those who do not meet is even “higher”.
.
.
.
.
.
Independent Sample Test
In order to test units that belongs to two different populations. Each unit in this case in measured once. Each unit belongs to only one population, the samples do not overlap
.
the independent t-test will be used here, because there are only two categories, and the categories are independent, that is, the populations do not intersect with each other
.
Research question: whether there is any difference in arithmetic mean in borrower’s revolving balance for those borrowers meets the credit criteria and for those who dont
.
For this, we selected a sample of 9578 units of observation, which include users (borrowers) of the platform LendingClub.com (platform where private borrowers and lenders meet). One unit of observation is one borrower.
.
Assumptions ho be hold:
analysed variable must be numeric.
distribution of the variable is normal in both populations.
data must come from two independent populations.
variable has the same variance in both populations. Welch correction otherwise.
.
In order to do the research of our research question we need two following variables:
revol.balN: borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle, in $), numerical
credit.policy.F: if the customer meets the credit under writing criteria or not, categorical, was turned into a factor before
library(psych)
describeBy(data$revol.balN, data$credit.policy.F) ###function that helps to make descriptive statistics for each factor variable group separately
##
## Descriptive statistics by group
## group: Fulfill
## vars n mean sd median trimmed mad min max range skew
## X1 1 7710 13798.4 16878.56 8707.5 10514.1 9423.41 0 149527 149527 2.97
## kurtosis se
## X1 12.54 192.22
## ------------------------------------------------------------
## group: Otherwise
## vars n mean sd median trimmed mad min max range
## X1 1 1868 29773.15 66807.57 8039.5 14032.46 10343.36 0 1207359 1207359
## skew kurtosis se
## X1 6.64 79.16 1545.74
Here are the basic descriptive statistics for the revolving balance by category. As we can see, the average outstanding amount for those who meet the criteria of the credit policy is 13798.4, which is about twice as much as the revolving balance of those who do not meet the criteria (29773.15).
Informally, we can assume that there is such a difference, moreover, it is significant.
However, since we are evaluating a sample rather than the entire population, we must perform formal tests before drawing any real conclusions.
.
The observation units in our sample belong to two different populations. That is, a borrower either meets the criteria or does not, and cannot be in both categories at the same time. Therefore, each unit in the sample was tested only once. Since these are two different populations, we choose an independent t-test.
.
Lets recall our assumptions once again and test whether they are hold.
Numeric variable. Yes, revolving balance is numeric, expresed in $
Distribution of the variable is normal in both populations. To be tested now
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
Fulfill <- ggplot(data[data$credit.policy.F == "Fulfill", ], aes(x = revol.balN)) + ###add dataframe, variable for x-axis and factor to group by
theme_linedraw() + ###change theme
geom_histogram() + ###type og the graph
labs(title = "Fulfill Criteria", ###add title
x = "Revolving Balance", ###name x-axis
y = "Frequency") ###name y-axis
Not_Fulfill <- ggplot(data[data$credit.policy.F == "Otherwise", ], aes(x = revol.balN)) + ###add dataframe, variable for x-axis and factor to group by
theme_linedraw() + ###change theme
geom_histogram() + ###type of the graph
labs(title = "Fulfill Criteria", ###add title
x = "Revolving Balance", ###name x-axis
y = "Frequency") ###name y-axis
library(ggpubr)
ggarrange(Fulfill, Not_Fulfill, ###group two graghs into one picture
ncol = 2, nrow = 1) ###2 columns, 1 row, means place them nearby horizontally
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Our sample is really large, so that its possinle to draw conclusions
from the graphs alone. As we can see, this is not a normal distribution
in either case. Both graphs are very much right-skewed, with the
smallest value having the highest frequency.
.
Although you can do without the Shapiro-Wilk test here, let me do it just to practice.
.
Shapiro_Will test can only be performed on small sample sizes from 3 to 5000. My sample is almost twice as large. Therefore, it should be shortened. To ensure unbiased selection of observation units, we use the following R command to select 4500 units at random.
small_sample <- data[sample(nrow(data), 4500), ]
summary(small_sample)
## int.rate installment log.annual.inc dti
## Min. :0.0600 Min. : 15.69 Min. : 7.60 Min. : 0.000
## 1st Qu.:0.1000 1st Qu.:163.14 1st Qu.:10.57 1st Qu.: 7.185
## Median :0.1200 Median :268.49 Median :10.93 Median :12.620
## Mean :0.1226 Mean :316.88 Mean :10.94 Mean :12.572
## 3rd Qu.:0.1400 3rd Qu.:421.89 3rd Qu.:11.29 3rd Qu.:17.850
## Max. :0.2200 Max. :940.14 Max. :14.53 Max. :29.740
##
## days.with.cr.line revol.util inq.last.6mths delinq.2yrs
## Min. : 180 Min. : 0.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 2820 1st Qu.: 22.68 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 4140 Median : 46.30 Median : 1.00 Median : 0.0000
## Mean : 4544 Mean : 46.83 Mean : 1.58 Mean : 0.1649
## 3rd Qu.: 5700 3rd Qu.: 70.70 3rd Qu.: 2.00 3rd Qu.: 0.0000
## Max. :17616 Max. :106.50 Max. :28.00 Max. :11.0000
##
## pub.rec credit.policy.F purposeF not.fully.paidF
## Min. :0.00000 Fulfill :3616 credit_card : 584 Not Paid: 731
## 1st Qu.:0.00000 Otherwise: 884 debt_consolidation:1857 Paid :3769
## Median :0.00000 educational : 159
## Mean :0.05844 major_purchase : 218
## 3rd Qu.:0.00000 small_business : 271
## Max. :3.00000 all_other :1113
## home_improvement : 298
## revol.balN
## Min. : 0
## 1st Qu.: 3124
## Median : 8670
## Mean : 16873
## 3rd Qu.: 18550
## Max. :952013
##
summary(data)
## int.rate installment log.annual.inc dti
## Min. :0.0600 Min. : 15.67 Min. : 7.55 Min. : 0.000
## 1st Qu.:0.1000 1st Qu.:163.77 1st Qu.:10.56 1st Qu.: 7.213
## Median :0.1200 Median :268.95 Median :10.93 Median :12.665
## Mean :0.1228 Mean :319.09 Mean :10.93 Mean :12.607
## 3rd Qu.:0.1400 3rd Qu.:432.76 3rd Qu.:11.29 3rd Qu.:17.950
## Max. :0.2200 Max. :940.14 Max. :14.53 Max. :29.960
##
## days.with.cr.line revol.util inq.last.6mths delinq.2yrs
## Min. : 179 Min. : 0.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 2820 1st Qu.: 22.6 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 4140 Median : 46.3 Median : 1.000 Median : 0.0000
## Mean : 4561 Mean : 46.8 Mean : 1.577 Mean : 0.1637
## 3rd Qu.: 5730 3rd Qu.: 70.9 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :17640 Max. :119.0 Max. :33.000 Max. :13.0000
##
## pub.rec credit.policy.F purposeF not.fully.paidF
## Min. :0.00000 Fulfill :7710 credit_card :1262 Not Paid:1533
## 1st Qu.:0.00000 Otherwise:1868 debt_consolidation:3957 Paid :8045
## Median :0.00000 educational : 343
## Mean :0.06212 major_purchase : 437
## 3rd Qu.:0.00000 small_business : 619
## Max. :5.00000 all_other :2331
## home_improvement : 629
## revol.balN
## Min. : 0
## 1st Qu.: 3187
## Median : 8596
## Mean : 16914
## 3rd Qu.: 18250
## Max. :1207359
##
with the command summary I propose to make basic descriptive statistics and compare the values of the main sample and the reduced sample. as we can see data were selected randomly and therefore are unbiased. in the reduced sample there are small deviations, but such deviations are small and will not harm further analysis
.
Now let’s do the Shapiro-Wilk test for the reduces sample
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
small_sample <- data[sample(nrow(data), 4500), ]
small_sample %>%
group_by(credit.policy.F) %>%
shapiro_test(revol.balN)
## # A tibble: 2 × 4
## credit.policy.F variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Fulfill revol.balN 0.697 1.08e-62
## 2 Otherwise revol.balN 0.418 1.71e-46
lets set hypothesis for the Shapiro-Wilk test
H0: data is normally distributed
H1: data is not normally distributed
as we conduct a test for two groups, then we have to evaluate the hypotheses for each group separately
for the group “Fulfill”, the p-value is 6.038402e-63. means we reject H0 at p-value < 0.001, hence our data is not normally distributed
for the group “Otherwise”, p-value is 6.933171e-43/ means we reject h0 at p-value < 0.001. hence our data is not normally distributed
.
this was the second confirmation that normality is absent, that is, the distribution of the variable is not normal. Therefore, we have to conduct a non-parametric test, that is, the Wilcoxon Rank Sum test
.
Due to the fact that the data do not satisfy one of the conditions, namely that they do not have a normal distribution, we have to conduct a non-parametric test, namely the Wilcoxon rank sum test.
.
let’s set our hypothesis for the Wilcoxon rank sum test:
H0: locations of distributions of the borrower’s revolving balance are the same for those who fulfill credit policy and for those who don’t
H1: locations of distributions are not the same
wilcox.test(data$revol.balN ~ data$credit.policy.F, ###add tested variable and factor variable
paired = FALSE, ###because its independent but not paired t-test
correct = FALSE,
exact = FALSE,
alternative = "two.sided") ###because be check if medians are the same or not. if we'd like to know if one of them if biger or smaller, we have to write "greater" of "less"
##
## Wilcoxon rank sum test
##
## data: data$revol.balN by data$credit.policy.F
## W = 7104372, p-value = 0.3668
## alternative hypothesis: true location shift is not equal to 0
p-value = 0.3668, which is greater than 0.05, so we cannot reject H0.
Hence, we cannot reject H0.
Therefore, we cannot reject the option that locations of distributions of the borrower’s revolving balance are the same for those who fulfill credit policy and for those who don’t
.
in order to estimate how strongly compliance or non-compliance with the criteria affects the revolving balance, let’s conduct a size effect test
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(data$revol.balN ~ data$credit.policy.F,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.01 | [-0.04, 0.02]
in absolute terms its effect size is equal to 0.01
if we want to estimate it manually, then 0.01 < 0.05, so the effect size is tiny.
one can also use R to determine how large the effect size is. for this we will use the following function
interpret_rank_biserial(0.01)
## [1] "tiny"
## (Rules: funder2019)
therefore, it is confirmed, effect size is tiny
Conclusion: As a conclusion, we can say for sure that locations of distributions of the borrower’s revolving balance are not the same (cannot reject H0, as p-value > 0.05) for those who fulfill credit policy and for those who don’t, so we can’t say for sure that the medians are different. Compliance with the credit policy has only a tiny effect (0.01) on the amount of the borrower’s revolving balance.
.
.
.
.
ANOVA test
here we will compare arithmetic means of independent samples (means observation units from different population do not intersect). this is an extension of independent sample t-test
.
anova will be used for the reason that the categories are independent, that is, the populations do not overlap with each other and there are more than two such categories
.
research question: whether there is a difference in the amount of instaalment between different categories of people with different purposes for borrowing
.
variables are necessary to investigate the research question:
installment: The monthly installments owed by the borrower if the loan is funded in $, numerical.
purpose: The purpose of the loan, categorical, nominal.
.
as the samples do not overlap, that is, one borrower has only one purpose of the loan, and there are more than two categories, we will conduct ANOVA test
.
let’s set assumptions:
Analyzed variable is numeric. the condition is true because the installment is measured in dollars and is numeric
Variable in the population is normally distributed within each group. Use non-parametric test, if violated
Homoscedasticity: the variance of analyzed variable is the same within all groups. if this is violated >> still parametric but with Welch correction
library(psych)
describeBy(x = data$installment, group = data$purposeF)
##
## Descriptive statistics by group
## group: credit_card
## vars n mean sd median trimmed mad min max range skew
## X1 1 1262 319.5 198.23 266.67 296.87 169.88 16.73 922.42 905.69 0.95
## kurtosis se
## X1 0.27 5.58
## ------------------------------------------------------------
## group: debt_consolidation
## vars n mean sd median trimmed mad min max range skew
## X1 1 3957 358.98 198.31 325.08 342.16 203.21 23.21 940.14 916.93 0.7
## kurtosis se
## X1 -0.17 3.15
## ------------------------------------------------------------
## group: educational
## vars n mean sd median trimmed mad min max range skew
## X1 1 343 217.55 168.51 169.62 190.44 125.74 15.67 861.88 846.21 1.59
## kurtosis se
## X1 2.65 9.1
## ------------------------------------------------------------
## group: major_purchase
## vars n mean sd median trimmed mad min max range skew
## X1 1 437 243.48 179.32 198.78 212.63 119.11 30.94 898.55 867.61 1.69
## kurtosis se
## X1 2.74 8.58
## ------------------------------------------------------------
## group: small_business
## vars n mean sd median trimmed mad min max range skew
## X1 1 619 433.83 248.59 394.36 422.7 276.49 16.25 926.83 910.58 0.39
## kurtosis se
## X1 -0.99 9.99
## ------------------------------------------------------------
## group: all_other
## vars n mean sd median trimmed mad min max range skew
## X1 1 2331 244.94 184.27 190.63 215.95 139.17 15.69 916.95 901.26 1.44
## kurtosis se
## X1 1.81 3.82
## ------------------------------------------------------------
## group: home_improvement
## vars n mean sd median trimmed mad min max range skew
## X1 1 629 337.07 222.11 282.4 313.57 195.67 28.47 902.06 873.59 0.83
## kurtosis se
## X1 -0.24 8.86
here we can see 7 different categories grouped by categorical variable Purpose which was turned into factor earlier
parameters do not differ very much between groups
so, the largest max value we see for the purpose of debt consolidation (940.14 USD) while the smallest max value is present for the purpose of education (861.88 USD)
the largest minimum value among the categories is observed for the category education and other reasons and is 16.7 dollars, while the highest minimum value is for the major purchases and is 30.9 dollars
we can conclude the following regarding teh mean. the highest average installment are for goal home improvment, credit card and debt consolidation while the smallest is for education
.
.
now we need to check Homoskedastisity of Variance , i.e. if variances are equal. this test we will find out whether we need a welch correction
.
we have a preliminary result regarding homascedasticity, we can make it by comparing the variances from the descriptive statistics above, we take it in the SD column. they should not be absolutely identical, but they should be +/- the same
but as we can see, these values are very different, for example, for the education group sd is 168.5 and for group small business the value is equal to 248.59. even this differs already by a third. also smaller values of sd are characteristic of such groups as major purchases (179) and all other reasons (184), while large sd values are for home improvement (222)
So we can make a preliminary assumption that homoscedasticity will be violated
.
but for a more accurate result, let’s conduct a Levene test
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
leveneTest(data$installment, group = data$purposeF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 6 38.588 < 2.2e-16 ***
## 9571
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: sigma ^2 (1) = sigma ^2 (2) = sigma ^2 (3)
H1: at least one differs
we reject H0 ar p<0.001
as homoscedasticity is violated, we need Welch correction
.
.
to check for normality there are two ways to do this
Let’s start with a boxplot
ggplot(data, aes(x=installment, ###add variable
fill=purposeF))+ ###group by categories
geom_boxplot() + ###type of th egraph
xlab("Installment") + ###x-axis name
labs(fill="Purpose of Credit", ###add legent
title = "Installment based on Purpose") ###add title
boxplot looks like none of the groups are normally distributed
we can observe outliers which are marked with bold dots on the graph for such groups as all other reasons, big purchases, education, debt consolidation and credit card
also all groups are right (positively) skewed, since the median as well as the first and third quartiles are shifted to the left. even removing the outlayer does not seem to correct the situation. only the small business group looks more or less normally distributed
.
.
in order to make sure that normality is violated, let’s conduct the Shapiro-Wilk test
library(dplyr)
library(rstatix)
data %>%
group_by(purposeF) %>%
shapiro_test(installment)
## # A tibble: 7 × 4
## purposeF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 credit_card installment 0.922 4.65e-25
## 2 debt_consolidation installment 0.951 1.77e-34
## 3 educational installment 0.849 1.09e-17
## 4 major_purchase installment 0.827 2.17e-21
## 5 small_business installment 0.943 9.91e-15
## 6 all_other installment 0.862 3.47e-41
## 7 home_improvement installment 0.916 3.33e-18
H0: variable is normally distributed
H1: is not
for each of these groups, we reject h0 at p-value < 0.001, therefore, the data in none of the groups are normally distributed
.
.
as normality is violated we have to go for Krushkal-Wallis Rank Sum Test
.
.
KRUSKAL-WALLIS RANK SUM TEST
Comparison of three or more distribution locations of variables for independent samples (extension of the Wilcoxon Rank Sum Test).
in this test we are going to compare if there is any difference in distribution locations of variables installment for each group of categorical variable purpose
library(rstatix)
data %>%
group_by(purposeF) %>%
get_summary_stats(installment, type = "median_iqr")
## # A tibble: 7 × 5
## purposeF variable n median iqr
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 credit_card installment 1262 267. 255.
## 2 debt_consolidation installment 3957 325. 290.
## 3 educational installment 343 170. 179.
## 4 major_purchase installment 437 199. 172.
## 5 small_business installment 619 394. 402.
## 6 all_other installment 2331 191. 208.
## 7 home_improvement installment 629 282. 328.
in this test we are particularly interested in comparing medians
as we can see from the statistics above for example for educational purpose Maiden is equal to 169.6 and in the same time the median for small businesses is equal to 394.36 that’s a huge difference, more than twice bigger
in general we can see very low medians as for educational purpose or major purchases and very large medians as for debt consolidation, small businesses or Home Improvement
.
now let’s run the Krushkal-Wallis Rand Sum test
kruskal.test(installment ~ purposeF,
data = data)
##
## Kruskal-Wallis rank sum test
##
## data: installment by purposeF
## Kruskal-Wallis chi-squared = 940.39, df = 6, p-value < 2.2e-16
let’s set our hypothesis:
H0: All distribution locations of variables are the same
H1: At least one distribution location of variable is different
We reject H0 at p<0.001
at least one distribution location differ from others
.
now let’s check which exactly group differs
library(rstatix)
groups_nonpar <- wilcox_test(installment ~ purposeF,
paired = FALSE, ###means test in not paired t-test, but independent one
p.adjust.method = "bonferroni", ###when there are more than 2 categories, its better to do an adjustment of p
data = data)
groups_nonpar
## # A tibble: 21 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 install… credi… debt_… 1262 3957 2164308 9.66e- 13 2.03e- 11 ****
## 2 install… credi… educa… 1262 343 291454 6.43e- 23 1.35e- 21 ****
## 3 install… credi… major… 1262 437 347766 3.71e- 16 7.79e- 15 ****
## 4 install… credi… small… 1262 619 285308 1.88e- 21 3.95e- 20 ****
## 5 install… credi… all_o… 1262 2331 1848241 4.95e- 37 1.04e- 35 ****
## 6 install… credi… home_… 1262 629 388231 4.38e- 1 1 e+ 0 ns
## 7 install… debt_… educa… 3957 343 991976 8.28e- 46 1.74e- 44 ****
## 8 install… debt_… major… 3957 437 1203204. 2.89e- 41 6.07e- 40 ****
## 9 install… debt_… small… 3957 619 1026402. 8.73e- 11 1.83e- 9 ****
## 10 install… debt_… all_o… 3957 2331 6355142. 9.79e-139 2.06e-137 ****
## # ℹ 11 more rows
assumptions will be similar to those from non-parametric version of independent test, Wilcoxon Rank Sum test
.
H0: locations of distributions of variable installment are the same for both groups of categorical variable purpose
H1: are not the same
remark: here we are comparing locations of distributions for each pair of separately
.
as we can see , p-value is less than Alpha (0.05) in most cases. that is, in most cases we reject H0 and accept H1. Only in certain exceptions we cannot reject h0
let me list those where we cannot reject h0 for p-value > 0.05
in the following cases we reject we reject h0:
credit_card - home_improvement
educational - major_purchase
educational - all_other
major_purchase - all_other
in these four pairs of categories, the locations of distribution of the installmen are the same for both categories
we reject H0 at p < 0.001 in all other cases
means in all other cases the locations of the distributions of installmen are Not the same for
.
.
Now let’s calculate what effect the purpose of borrowing has on the size of installment. for this, we will conduct the following test
kruskal_effsize(installment ~ purposeF,
data = data)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 installment 9578 0.0976 eta2[H] moderate
as we can see the effect size is 0.0976 which is determined as moderate effect size
.
.
Conclusion: as a result of the conducted test, we learned that there is a difference (H0 rejected at p < 0.001) between the distribution locations of the installment in different groups. this difference exists in almost all pairs of categories except the following: credit_card - home_improvement, educational - major_purchase, educational - all_other, major_purchase - all_other. purpose has a moderate effect on installment amount