Author: Polina Kamenchuk. Student ID: 12300685
##to upload data set into R.
data <- read.table("./loan_data.csv", ###name of the file
header = TRUE, ###first row as a header
sep = ",", ###columns are separated from each other with ","
dec = ".") ###"." is used in decimal numbers
str(data) ###to see the data structure
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : chr "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
head(data) ###to display the first 6 rows
## credit.policy purpose int.rate installment log.annual.inc dti
## 1 1 debt_consolidation 0.1189 829.10 11.35041 19.48
## 2 1 credit_card 0.1071 228.22 11.08214 14.29
## 3 1 debt_consolidation 0.1357 366.86 10.37349 11.63
## 4 1 debt_consolidation 0.1008 162.34 11.35041 8.10
## 5 1 credit_card 0.1426 102.92 11.29973 14.97
## 6 1 credit_card 0.0788 125.13 11.90497 16.98
## fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs
## 1 737 5639.958 28854 52.1 0 0
## 2 707 2760.000 33623 76.7 0 0
## 3 682 4710.000 3511 25.6 1 0
## 4 712 2699.958 33667 73.2 1 0
## 5 667 4066.000 4740 39.5 0 1
## 6 727 6120.042 50807 51.0 0 0
## pub.rec not.fully.paid
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
DATA DESCRIPTION
data set was taken from the Kaggle website (https://www.kaggle.com/datasets/itssuru/loan-data)
sample contains the users (borrowers) of the platform LendingClub.com (platform where private borrowers and lenders meet) sample size is 9578 observations (it can be seen from data in Environment)
unit of observation is one such borrower
description of each variable: - credit.policy: 1 if the customer meets the credit underwriting criteria, and 0 if not, categorical. - purpose: The purpose of the loan, categorical, nominal. - int.rate: The interest rate of the loan, as a proportion, means 10% = 0,10, numerical. - installment: The monthly installments owed by the borrower if the loan is funded in $, numerical. - log.annual.inc: The natural log of the self-reported annual income of the borrower, numerical. - dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income, in \(). - fico: The FICO credit score of the borrower, in scores, numerical. - days.with.cr.line: The number of days the borrower has had a credit line, number of days, numerical. - revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle, in\)). - revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available, , numerical). - inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months, amount, numerical. - delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years, numerical. - pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
###i dont really understand this fico score and i accume there are enough variables to analyse, so let me remove it from the data set :)
data <- data[, -7 ]
head(data)
## credit.policy purpose int.rate installment log.annual.inc dti
## 1 1 debt_consolidation 0.1189 829.10 11.35041 19.48
## 2 1 credit_card 0.1071 228.22 11.08214 14.29
## 3 1 debt_consolidation 0.1357 366.86 10.37349 11.63
## 4 1 debt_consolidation 0.1008 162.34 11.35041 8.10
## 5 1 credit_card 0.1426 102.92 11.29973 14.97
## 6 1 credit_card 0.0788 125.13 11.90497 16.98
## days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 5639.958 28854 52.1 0 0 0
## 2 2760.000 33623 76.7 0 0 0
## 3 4710.000 3511 25.6 1 0 0
## 4 2699.958 33667 73.2 1 0 0
## 5 4066.000 4740 39.5 0 1 0
## 6 6120.042 50807 51.0 0 0 0
## not.fully.paid
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
calculate_number_of_variables <- function(data) {
num_variables <- ncol(data)
cat("Number of variables in the dataset:", num_variables, "\n")
}
calculate_number_of_variables(data)
## Number of variables in the dataset: 13
and also there are 13 variables which can be also seen from the Environment > data. but also can be calculated with function above
data$credit.policy.F <- factor(data$credit.policy,
levels = c(1, 0),
labels = c("Fulfill", "Oterwise")) ###to convert categorical into factor
data$purposeF <- factor(data$purpose,
levels = c("credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", "all_other", "home_improvement"),
labels = c("credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", "all_other", "home_improvement")) ###to convert categorical into factor
data$not.fully.paidF <- factor(data$not.fully.paid,
levels = c(1, 0),
labels = c("Not Paid", "Paid")) ###to convert categorical into factor
data$revol.balN <- as.numeric(data$revol.bal) ###to convert integer to factor as i would say the amount unpaid is in dollars so in should be numerical
data <- data[ , c(-1, -2, -8, -13) ] ### i would also delete variables before transformation, because they make no sense and we have new correct one
head(data, 15) ### display first 15 rows
## int.rate installment log.annual.inc dti days.with.cr.line revol.util
## 1 0.1189 829.10 11.350407 19.48 5639.958 52.1
## 2 0.1071 228.22 11.082143 14.29 2760.000 76.7
## 3 0.1357 366.86 10.373491 11.63 4710.000 25.6
## 4 0.1008 162.34 11.350407 8.10 2699.958 73.2
## 5 0.1426 102.92 11.299732 14.97 4066.000 39.5
## 6 0.0788 125.13 11.904968 16.98 6120.042 51.0
## 7 0.1496 194.02 10.714418 4.00 3180.042 76.8
## 8 0.1114 131.22 11.002100 11.08 5116.000 68.6
## 9 0.1134 87.19 11.407565 17.25 3989.000 51.1
## 10 0.1221 84.12 10.203592 10.00 2730.042 23.0
## 11 0.1347 360.43 10.434116 22.09 6713.042 71.0
## 12 0.1324 253.58 11.835009 9.16 4298.000 18.2
## 13 0.0859 316.11 10.933107 15.49 6519.958 16.7
## 14 0.0714 92.82 11.512925 6.50 4384.000 4.8
## 15 0.0863 209.54 9.487972 9.73 1559.958 44.6
## inq.last.6mths delinq.2yrs pub.rec credit.policy.F purposeF
## 1 0 0 0 Fulfill debt_consolidation
## 2 0 0 0 Fulfill credit_card
## 3 1 0 0 Fulfill debt_consolidation
## 4 1 0 0 Fulfill debt_consolidation
## 5 0 1 0 Fulfill credit_card
## 6 0 0 0 Fulfill credit_card
## 7 0 0 1 Fulfill debt_consolidation
## 8 0 0 0 Fulfill all_other
## 9 1 0 0 Fulfill home_improvement
## 10 1 0 0 Fulfill debt_consolidation
## 11 2 0 1 Fulfill debt_consolidation
## 12 2 1 0 Fulfill debt_consolidation
## 13 0 0 0 Fulfill debt_consolidation
## 14 0 1 0 Fulfill small_business
## 15 0 0 0 Fulfill debt_consolidation
## not.fully.paidF revol.balN
## 1 Paid 28854
## 2 Paid 33623
## 3 Paid 3511
## 4 Paid 33667
## 5 Paid 4740
## 6 Paid 50807
## 7 Not Paid 3839
## 8 Not Paid 24220
## 9 Paid 69909
## 10 Paid 5630
## 11 Paid 13846
## 12 Paid 5122
## 13 Paid 6068
## 14 Paid 3021
## 15 Paid 6282
str(data) ###structure
## 'data.frame': 9578 obs. of 13 variables:
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ credit.policy.F : Factor w/ 2 levels "Fulfill","Oterwise": 1 1 1 1 1 1 1 1 1 1 ...
## $ purposeF : Factor w/ 7 levels "credit_card",..: 2 1 2 2 1 1 2 6 7 2 ...
## $ not.fully.paidF : Factor w/ 2 levels "Not Paid","Paid": 2 2 2 2 2 2 1 1 2 2 ...
## $ revol.balN : num 28854 33623 3511 33667 4740 ...
###to round data to 2 decimal munbers (just to make it a bit more nice looking)
data$log.annual.inc <- round(data$log.annual.inc, 2)
data$int.rate <- round(data$int.rate, 2)
data$installment <- round(data$installment, 2)
data$dti <- round(data$dti, 2)
data$days.with.cr.line <- round(data$days.with.cr.line, 2)
head(data)
## int.rate installment log.annual.inc dti days.with.cr.line revol.util
## 1 0.12 829.10 11.35 19.48 5639.96 52.1
## 2 0.11 228.22 11.08 14.29 2760.00 76.7
## 3 0.14 366.86 10.37 11.63 4710.00 25.6
## 4 0.10 162.34 11.35 8.10 2699.96 73.2
## 5 0.14 102.92 11.30 14.97 4066.00 39.5
## 6 0.08 125.13 11.90 16.98 6120.04 51.0
## inq.last.6mths delinq.2yrs pub.rec credit.policy.F purposeF
## 1 0 0 0 Fulfill debt_consolidation
## 2 0 0 0 Fulfill credit_card
## 3 1 0 0 Fulfill debt_consolidation
## 4 1 0 0 Fulfill debt_consolidation
## 5 0 1 0 Fulfill credit_card
## 6 0 0 0 Fulfill credit_card
## not.fully.paidF revol.balN
## 1 Paid 28854
## 2 Paid 33623
## 3 Paid 3511
## 4 Paid 33667
## 5 Paid 4740
## 6 Paid 50807
###to find missing values. if there are any, we can remove them with the function "drop_na"
find_missing_values <- function(data) {
missing <- sum(is.na(data))
if(missing > 0) {
cat("Number of missing values in the dataset:", missing, "\n")
cat("Indices of missing values:\n")
print(which(is.na(data), arr.ind = TRUE))
} else {
cat("No missing values found in the dataset.\n")
}
}
find_missing_values(data)
## No missing values found in the dataset.
###hovewer, as we can see, no missing values here
summary(data) ###descriptive statistic for all variables
## int.rate installment log.annual.inc dti
## Min. :0.0600 Min. : 15.67 Min. : 7.55 Min. : 0.000
## 1st Qu.:0.1000 1st Qu.:163.77 1st Qu.:10.56 1st Qu.: 7.213
## Median :0.1200 Median :268.95 Median :10.93 Median :12.665
## Mean :0.1228 Mean :319.09 Mean :10.93 Mean :12.607
## 3rd Qu.:0.1400 3rd Qu.:432.76 3rd Qu.:11.29 3rd Qu.:17.950
## Max. :0.2200 Max. :940.14 Max. :14.53 Max. :29.960
##
## days.with.cr.line revol.util inq.last.6mths delinq.2yrs
## Min. : 179 Min. : 0.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 2820 1st Qu.: 22.6 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 4140 Median : 46.3 Median : 1.000 Median : 0.0000
## Mean : 4561 Mean : 46.8 Mean : 1.577 Mean : 0.1637
## 3rd Qu.: 5730 3rd Qu.: 70.9 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :17640 Max. :119.0 Max. :33.000 Max. :13.0000
##
## pub.rec credit.policy.F purposeF not.fully.paidF
## Min. :0.00000 Fulfill :7710 credit_card :1262 Not Paid:1533
## 1st Qu.:0.00000 Oterwise:1868 debt_consolidation:3957 Paid :8045
## Median :0.00000 educational : 343
## Mean :0.06212 major_purchase : 437
## 3rd Qu.:0.00000 small_business : 619
## Max. :5.00000 all_other :2331
## home_improvement : 629
## revol.balN
## Min. : 0
## 1st Qu.: 3187
## Median : 8596
## Mean : 16914
## 3rd Qu.: 18250
## Max. :1207359
##
summary(data [ , c(-10, -11, -12) ]) ###descriptive statistics without categorical variable
## int.rate installment log.annual.inc dti
## Min. :0.0600 Min. : 15.67 Min. : 7.55 Min. : 0.000
## 1st Qu.:0.1000 1st Qu.:163.77 1st Qu.:10.56 1st Qu.: 7.213
## Median :0.1200 Median :268.95 Median :10.93 Median :12.665
## Mean :0.1228 Mean :319.09 Mean :10.93 Mean :12.607
## 3rd Qu.:0.1400 3rd Qu.:432.76 3rd Qu.:11.29 3rd Qu.:17.950
## Max. :0.2200 Max. :940.14 Max. :14.53 Max. :29.960
## days.with.cr.line revol.util inq.last.6mths delinq.2yrs
## Min. : 179 Min. : 0.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 2820 1st Qu.: 22.6 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 4140 Median : 46.3 Median : 1.000 Median : 0.0000
## Mean : 4561 Mean : 46.8 Mean : 1.577 Mean : 0.1637
## 3rd Qu.: 5730 3rd Qu.: 70.9 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :17640 Max. :119.0 Max. :33.000 Max. :13.0000
## pub.rec revol.balN
## Min. :0.00000 Min. : 0
## 1st Qu.:0.00000 1st Qu.: 3187
## Median :0.00000 Median : 8596
## Mean :0.06212 Mean : 16914
## 3rd Qu.:0.00000 3rd Qu.: 18250
## Max. :5.00000 Max. :1207359
I don’t want to talk about everything, it doesn’t make any sense to me, so I’ll just talk about two variables of choice.
The lowest interest rate that existed in this sample was 0%, the highest was 22%. 1st Qu means that a quarter of people had interest rates of 10% or lower, median means that half of people had interest rates of 12% or lower, and 3rd Qu means that 75% of borrowers had interest rates of 14% or lower and respectively the last quarter of people have interest rates above 14%. “mean” means that the arithmetic average value of the interest rate is 12.28%.
number of days the borrower has had a credit line. the minimum loan period is 179 days, the maximum is 17,640 days. mean means that if we add all the values and divide by the number of observations, then we will find the arithmetic mean, and this arithmetic mean will mean that on average people borrow for 4561 days. 1st Qu means that a quarter of people take a loan for a term of 2820 days or a shorter term. the median means that half of the people borrow for 4,140 days or less. 3rd Qu means that people take debt for 5730 days or less period of time.
###another way to calculate the desired values from descriptive statistics is to use the formulas below. I will calculate only for the borrower's revolving balance variable.
min(data$revol.balN)
## [1] 0
max(data$revol.balN)
## [1] 1207359
mean(data$revol.balN)
## [1] 16913.96
median(data$revol.balN)
## [1] 8596
sd(data$revol.balN)
## [1] 33756.19
var(data$revol.balN)
## [1] 1139480333
range(data$revol.balN)
## [1] 0 1207359
sum(data$revol.balN)
## [1] 162001946
quantile(data$revol.balN)
## 0% 25% 50% 75% 100%
## 0.0 3187.0 8596.0 18249.5 1207359.0
quantile(data$revol.balN, p=0.10)
## 10%
## 710.7
min: the smallest unpaid amount is $0 (i.e. everything is paid) max: the maximum unpaid amount is $1,207,359 average: The average outstanding amount of the borrower in the sample is $16,913.96. median: half of the borrowers did not pay the amount unpaid at the end of the credit card billing cycle \(8,596 or less. sd: it means that the values are, on average, 33756.19 units (\)) away from the mean. range: this is the difference between the largest and smallest values. that is, the unpaid amount is the difference between the borrowers by $1,207,359 amount: the total amount of all unpaid funds from all borrowers is $16,200,1946 quantile: A quarter of people did not pay $3,187 or less. Exactly half of the borrowers did not pay $8,596 or less. And 75% of people did not pay $18,249.5 or less quantile, 10%: This means that 10% of people did not pay an amount of $710.7 or less.
#install.packages("pastecs")
library(pastecs)
round(stat.desc(data[ , c(-10, -11, -12) ]), 2)
## int.rate installment log.annual.inc dti days.with.cr.line
## nbr.val 9578.00 9578.00 9578.00 9578.00 9578.00
## nbr.null 0.00 0.00 0.00 89.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 0.06 15.67 7.55 0.00 178.96
## max 0.22 940.14 14.53 29.96 17639.96
## range 0.16 924.47 6.98 29.96 17461.00
## sum 1175.73 3056238.40 104710.00 120746.77 43683027.92
## median 0.12 268.95 10.93 12.66 4139.96
## mean 0.12 319.09 10.93 12.61 4560.77
## SE.mean 0.00 2.12 0.01 0.07 25.51
## CI.mean.0.95 0.00 4.15 0.01 0.14 50.01
## var 0.00 42878.52 0.38 47.39 6234661.48
## std.dev 0.03 207.07 0.61 6.88 2496.93
## coef.var 0.22 0.65 0.06 0.55 0.55
## revol.util inq.last.6mths delinq.2yrs pub.rec revol.balN
## nbr.val 9578.00 9578.00 9578.00 9578.00 9.578000e+03
## nbr.null 297.00 3637.00 8458.00 9019.00 3.210000e+02
## nbr.na 0.00 0.00 0.00 0.00 0.000000e+00
## min 0.00 0.00 0.00 0.00 0.000000e+00
## max 119.00 33.00 13.00 5.00 1.207359e+06
## range 119.00 33.00 13.00 5.00 1.207359e+06
## sum 448243.08 15109.00 1568.00 595.00 1.620019e+08
## median 46.30 1.00 0.00 0.00 8.596000e+03
## mean 46.80 1.58 0.16 0.06 1.691396e+04
## SE.mean 0.30 0.02 0.01 0.00 3.449200e+02
## CI.mean.0.95 0.58 0.04 0.01 0.01 6.761100e+02
## var 841.84 4.84 0.30 0.07 1.139480e+09
## std.dev 29.01 2.20 0.55 0.26 3.375619e+04
## coef.var 0.62 1.39 3.34 4.22 2.000000e+00
###The stat.desc function also calculates basic statistics. round is needed to round the value to two decimal places so that the results look better. I also removed categorical variables, because their statistical description does not make any sense
Here, too, I will explain only one variable - installment. nbr.val 9578.00 is the number of observations that have a value, that is, which are not missing or not zero. in this particular case, it is all 9578 observations nbr.null 0.00 and nbr.na 0.00 - this means that we have neither zeros nor missing data in the installment data min 15.67 - that is, the minimum monthly payment is $15.67 max 940.14 - this means that the borrower’s maximum monthly payment is $940.14 range 924.47 means that the difference between the largest and smallest monthly payments of the sample is $924.47 sum 3056238.40 - that is, for every month all members of the sample taken together must make a payment of $3056238.40 per month. median 268.95 means that half of borrowers pay $268.95 or less each month. mean 319.09 means that on average the borrower pays $319.0 per month
library(psych)
describe(data [ , c(-10, -11, -12) ]) ###another function to calculate statistics, without categorical variables as well
## vars n mean sd median trimmed mad min
## int.rate 1 9578 0.12 0.03 0.12 0.12 0.03 0.06
## installment 2 9578 319.09 207.07 268.95 295.64 184.88 15.67
## log.annual.inc 3 9578 10.93 0.61 10.93 10.93 0.55 7.55
## dti 4 9578 12.61 6.88 12.66 12.59 7.98 0.00
## days.with.cr.line 5 9578 4560.77 2496.93 4139.96 4303.64 2135.06 178.96
## revol.util 6 9578 46.80 29.01 46.30 46.50 35.88 0.00
## inq.last.6mths 7 9578 1.58 2.20 1.00 1.16 1.48 0.00
## delinq.2yrs 8 9578 0.16 0.55 0.00 0.02 0.00 0.00
## pub.rec 9 9578 0.06 0.26 0.00 0.00 0.00 0.00
## revol.balN 10 9578 16913.96 33756.19 8596.00 10809.19 9619.11 0.00
## max range skew kurtosis se
## int.rate 0.22 0.16 0.17 -0.21 0.00
## installment 940.14 924.47 0.91 0.14 2.12
## log.annual.inc 14.53 6.98 0.03 1.61 0.01
## dti 29.96 29.96 0.02 -0.90 0.07
## days.with.cr.line 17639.96 17461.00 1.16 1.94 25.51
## revol.util 119.00 119.00 0.06 -1.12 0.30
## inq.last.6mths 33.00 33.00 3.58 26.27 0.02
## delinq.2yrs 13.00 13.00 6.06 71.38 0.01
## pub.rec 5.00 5.00 5.12 38.75 0.00
## revol.balN 1207359.00 1207359.00 11.16 259.46 344.92
here I propose to analyze the pub.rec variable, i.e. The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments). This is variable (var) number 9 in the output table. n means the number of observations, which are always 9578 mean, as always, means the arithmetic mean, that is, on average, borrowers have 0.06 public records for each. sd stands for standard deviation, i.e, it means that the values are, on average, 0,26 units (number of records) away from the mean median of 0.00 means that half of the people have 0 public records trimmed refers to a dataset from which a certain proportion of the data points have been removed from one or both ends of the distribution. This is done to reduce the influence of extreme values (outliers) on statistical measures. As we can see, there are no outliers in here. the minimum and maximum values show the largest and smallest number of public records the borrower has, in this case 0 and 5, respectively. The range shows the spread of the data, what is their spectrum, it is max-min=5-0=5 units. skewness (skew) means asymmetry of the distribution to the left or to the right from the mean. in this case there is a positive skew, means there is a longer “tail” om the right-hand side. kurtosis refers to the measure of the tailedness or peakedness of a distribution.in our case, this means heavier tails and a sharper peak compared to a normal distribution
describeBy(data$dti, group = data$purpose)
##
## Descriptive statistics by group
## group: credit_card
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1262 14.1 6.47 14.38 14.21 7.46 0 29.95 29.95 -0.11 -0.8 0.18
## ------------------------------------------------------------
## group: debt_consolidation
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3957 14.08 6.43 14.24 14.18 7.35 0 29.96 29.96 -0.09 -0.77 0.1
## ------------------------------------------------------------
## group: educational
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 343 11.34 6.94 11.42 11.15 8.29 0 29.74 29.74 0.19 -0.87 0.37
## ------------------------------------------------------------
## group: major_purchase
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 437 10.16 6.63 9.51 9.91 7.78 0 26.15 26.15 0.26 -0.96 0.32
## ------------------------------------------------------------
## group: small_business
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 619 10.79 6.93 10.39 10.53 8.09 0 29.21 29.21 0.26 -0.92 0.28
## ------------------------------------------------------------
## group: all_other
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2331 11.08 7.1 10.56 10.84 8.51 0 29.9 29.9 0.23 -0.92 0.15
## ------------------------------------------------------------
## group: home_improvement
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 629 10.2 6.78 9.66 9.82 8.04 0 28.17 28.17 0.38 -0.75 0.27
this function calculated the strategy for the variable “dti” for each category from the categorical variable “purpose” separately. it is more convenient to compare these data through knit. but actually, the output will be in this format anyway so, we can see that, for example, the most popular purpose is debt_consolidation, with 3957 observations in this category. followed by all_other and credit_card with values of 2331 and 1262 respectively. the least popular goal, on the other hand, is educational, with only 343 observations. the average debt-to-income ratio also varies by category. well, for credit_card, for example, this ratio is 14.1, meaning that, on average, the debt of those who borrow money for a credit card is 14.1 times higher than their income. This, by the way, is also the highest ratio instead, for example, for those who borrow money for educational purposes, the amount of debt exceeds their income by 11.34 times. proportion has the lowest value for major_purchase and home_improvement and is 10.16 and 10.20, respectively. the two highest median values are 14.38 and 14.24 for credit_card and debt_consolidation, respectively, which means that half of people who borrow for credit cards have a debt-to-income ratio of 14.38:1 or less, and half of those who borrow for debt consolidation, have a debt-to-income ratio of 14.24:1 or less. By “less” I mean that there are fewer borrowed dollars per $1 of income.
describeBy(data$revol.balN, group = data$credit.policy.F)
##
## Descriptive statistics by group
## group: Fulfill
## vars n mean sd median trimmed mad min max range skew
## X1 1 7710 13798.4 16878.56 8707.5 10514.1 9423.41 0 149527 149527 2.97
## kurtosis se
## X1 12.54 192.22
## ------------------------------------------------------------
## group: Oterwise
## vars n mean sd median trimmed mad min max range
## X1 1 1868 29773.15 66807.57 8039.5 14032.46 10343.36 0 1207359 1207359
## skew kurtosis se
## X1 6.64 79.16 1545.74
so here’s another example of that. how to calculate statistics for each category separately. here such statistics are calculated for the quantitative variable “revol.balN”, which means The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). It is grouped by categories, whether the customer meets the credit underwriting criteria of the platform or not. let’s analyze here the values that I did not analyze in the previous example. therefore, the minimum and maximum amount of outstanding debt for those who meet the criteria is $0 and \(149,527, respectively. While for those who do not meet the criteria, the minimum and maximum values are 0\) and 1207359$ respectively. both skewnesses have a positive value: 2.97 for those who meet the criteria and 6.64 for those who do not meet the criteria. positive skew means that the right “tail” is longer. kurtosis in both cases is positive, so the distribution deviates from normal upwards. moreover, for those who do not meet the criteria, the value is larger (79.16 vs 12.54), that is, the distribution of those who do not meet is even “higher”.
hist(data$dti, ###to chose variable
main = "Distribution of Debt-to-Income Ratio", ###add title
ylab = "Frequency", ### add name for the y axis
xlab = "Debt-to-Income Ratio", ###add name for the x axis
breaks = seq(from = 0, to = 30, by = 5)) ###so the data will range from 0 to 30 and the x-axis interval will be 5 units
a histogram visualizes debt-to-income ratio data and shows the frequency with which each ratio occurs. on the Y-axis we can see the frequency. here we see that the ratio 10-15 occurs most often, and the ratio 25-30 is the least frequent. We can also see that there is positive skewness here.
#install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(data, aes(x = dti)) + ###add dataframe and variable for x-axis
geom_histogram(binwidth = 5, ###the x-axis interval will be 5 units
colour = "darkblue", ###Change the color of the column borders
fill = "skyblue") + ###Change the color of the columns
labs(title = "Debt-to-Income Frequency", ###add title
x = "Debt-to-Income Ratio", ###name x-axis
y = "Frequency") + ###name y-axis
theme_minimal() ###change theme
in fact, here we see the same histogram, for the same variable. the difference is only in the creation tool (that is, the R command and the package to which it belongs), and the fact that this one has a slightly nicer appearance.
boxplot(data$revol.util)
here we can see the boxplot for the revol.util variable. there is no scale on the x-axis, but the y-axis can be read this way. the limits show the minimum (0) and maximum (119) value, that is, the range of the data can be calculated in this way. the thick line represents the mean value, which in this case is 46.80 units. Let me remind again that this variable means the amount of the credit line used relative to total credit available and denotes the share. the boundaries of the gray figure show the first and third quantile values, which are 22.6 and 70.9, respectively.
library(ggplot2)
ggplot(data, aes(x = revol.util)) +
geom_boxplot(colour = "darkred", ###Change the color of the column borders
fill = "pink") + ###Change the color of the columns
labs(title = "revolving line utilization rate", ###add title
x = "values", ###name x-axis
y = "revolving line utilization rate") + ###name y-axis
theme_classic() ###change theme
here is actually the same boxplot for revol.util. the difference is only in the creation tool (that is, the R command and the package to which it belongs), nicer appearance and the fact that here the values are located along the x-axis, and the graph itself is horizontal, not vertical.
ggplot(data, aes(x=dti, ###add variable
fill=purposeF))+ ###group by categories
geom_boxplot() + ###type of th egraph
xlab("Debt-to-Income Ratio") + ###x-axis name
labs(fill="Purpose of Credit", ###add legent
title = "Debt-to-Income based on Purpose") ###add title
such a graph allows us to visually compare how the Debt-to-Income Ratio differs depending on the amount of the loan. For example, we see that on average this ratio is much higher for people whose goal is an average credit card and debt consumption. range for those who borrow for a major purchase is the smallest, its maximum value of ratio is no more than 26. Although the average is approximately the same in small business and all other reasons, this ratio for 75% of borrowers for the purpose of small business is smaller, which can be seen from the limits box, that is, the third quantiles.
library(ggplot2)
ggplot(data, aes(x = int.rate, y = log.annual.inc)) + ###add variables for x- and y-axises
geom_point(shape = 21, ###change shape of the points
size = 3, ###change size of the points
color = "white", ###Change the color of the points borders
fill = "wheat1") + ######Change the color of the points
labs(title = "Income-to-Interest Rate Relation", ###add title
x = "Interest Rate", ###and name to the x-axis
y = "Annual Income in log-points")+ ###add name to the y-axis
theme_dark() ###change theme
so we can investigate the relationship between the interest rate and the borrower’s income.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(int.rate ~ revol.balN | not.fully.paidF, ###add variables: int.rate to y-axis, revol.balN to x-axis, goup by not.fully.paidF
smooth = FALSE,
xlab = "Balance", ###add x-axis name
ylab = "Intereset rate", ###add y-axis name
main = "Scatterplot", ###add title
data = data, ###choose dataframe
cex = 0.5, ###change points size
bg = "lightyellow")
In addition to showing the relationship between two numerical variables,
this graph also allows us to compare how this relationship differs based
on the categorical variable “not.fully.paid”. we can clearly see that
the correlation is much stronger for those who have paid in full because
the trend line is deeper. for both categories, this ratio is positive,
because the trend lines “go up”.