hw2

Author: Polina Kamenchuk. Student ID: 12300685

##to upload data set into R.

data <- read.table("./loan_data.csv", ###name of the file 
                     header = TRUE,  ###first row as a header 
                     sep = ",",  ###columns are separated from each other with ","
                     dec = ".")    ###"." is used in decimal numbers

str(data)   ###to see the data structure

## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ purpose          : chr  "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...

head(data)   ###to display the first 6 rows

##   credit.policy            purpose int.rate installment log.annual.inc   dti
## 1             1 debt_consolidation   0.1189      829.10       11.35041 19.48
## 2             1        credit_card   0.1071      228.22       11.08214 14.29
## 3             1 debt_consolidation   0.1357      366.86       10.37349 11.63
## 4             1 debt_consolidation   0.1008      162.34       11.35041  8.10
## 5             1        credit_card   0.1426      102.92       11.29973 14.97
## 6             1        credit_card   0.0788      125.13       11.90497 16.98
##   fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs
## 1  737          5639.958     28854       52.1              0           0
## 2  707          2760.000     33623       76.7              0           0
## 3  682          4710.000      3511       25.6              1           0
## 4  712          2699.958     33667       73.2              1           0
## 5  667          4066.000      4740       39.5              0           1
## 6  727          6120.042     50807       51.0              0           0
##   pub.rec not.fully.paid
## 1       0              0
## 2       0              0
## 3       0              0
## 4       0              0
## 5       0              0
## 6       0              0

DATA DESCRIPTION

data set was taken from the Kaggle website (https://www.kaggle.com/datasets/itssuru/loan-data)

sample contains the users (borrowers) of the platform LendingClub.com (platform where private borrowers and lenders meet) sample size is 9578 observations (it can be seen from data in Environment)

unit of observation is one such borrower

description of each variable:

credit.policy: 1 if the customer meets the credit underwriting criteria, and 0 if not, categorical.

purpose: The purpose of the loan, categorical, nominal.

int.rate: The interest rate of the loan, as a proportion, means 10% = 0,10, numerical.

installment: The monthly installments owed by the borrower if the loan is funded in USD, numerical.

log.annual.inc: The natural log of the self-reported annual income of the borrower, numerical.

dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income, in USD).

fico: The FICO credit score of the borrower, in scores, numerical.

days.with.cr.line: The number of days the borrower has had a credit line, number of days, numerical.

revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle, inUSD).

revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available, , numerical).

inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months, amount, numerical.

delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years, numerical.

pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

###i dont really understand this fico score and i accume there are enough variables to analyse, so let me remove it from the data set :)

data <- data[, -7 ]

head(data)

##   credit.policy            purpose int.rate installment log.annual.inc   dti
## 1             1 debt_consolidation   0.1189      829.10       11.35041 19.48
## 2             1        credit_card   0.1071      228.22       11.08214 14.29
## 3             1 debt_consolidation   0.1357      366.86       10.37349 11.63
## 4             1 debt_consolidation   0.1008      162.34       11.35041  8.10
## 5             1        credit_card   0.1426      102.92       11.29973 14.97
## 6             1        credit_card   0.0788      125.13       11.90497 16.98
##   days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec
## 1          5639.958     28854       52.1              0           0       0
## 2          2760.000     33623       76.7              0           0       0
## 3          4710.000      3511       25.6              1           0       0
## 4          2699.958     33667       73.2              1           0       0
## 5          4066.000      4740       39.5              0           1       0
## 6          6120.042     50807       51.0              0           0       0
##   not.fully.paid
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0

calculate_number_of_variables <- function(data) {
  num_variables <- ncol(data)
  cat("Number of variables in the dataset:", num_variables, "\n")
}

calculate_number_of_variables(data)

## Number of variables in the dataset: 13

and also there are 13 variables which can be also seen from the Environment > data. but also can be calculated with function above

data$credit.policy.F <- factor(data$credit.policy, 
                               levels = c(1, 0), 
                               labels = c("Fulfill", "Otherwise"))  ###to convert categorical into factor


data$purposeF <- factor(data$purpose, 
                               levels = c("credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", "all_other", "home_improvement"), 
                               labels = c("credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", "all_other", "home_improvement"))    ###to convert categorical into factor


data$not.fully.paidF <- factor(data$not.fully.paid, 
                               levels = c(1, 0), 
                               labels = c("Not Paid", "Paid"))     ###to convert categorical into factor

data$revol.balN <- as.numeric(data$revol.bal)     ###to convert integer to factor as i would say the amount unpaid is in dollars so in should be numerical 


data <- data[ ,  c(-1, -2, -8, -13)   ]  ### i would also delete variables before transformation, because they make no sense and we have new correct one

head(data, 15)   ### display first 15 rows

##    int.rate installment log.annual.inc   dti days.with.cr.line revol.util
## 1    0.1189      829.10      11.350407 19.48          5639.958       52.1
## 2    0.1071      228.22      11.082143 14.29          2760.000       76.7
## 3    0.1357      366.86      10.373491 11.63          4710.000       25.6
## 4    0.1008      162.34      11.350407  8.10          2699.958       73.2
## 5    0.1426      102.92      11.299732 14.97          4066.000       39.5
## 6    0.0788      125.13      11.904968 16.98          6120.042       51.0
## 7    0.1496      194.02      10.714418  4.00          3180.042       76.8
## 8    0.1114      131.22      11.002100 11.08          5116.000       68.6
## 9    0.1134       87.19      11.407565 17.25          3989.000       51.1
## 10   0.1221       84.12      10.203592 10.00          2730.042       23.0
## 11   0.1347      360.43      10.434116 22.09          6713.042       71.0
## 12   0.1324      253.58      11.835009  9.16          4298.000       18.2
## 13   0.0859      316.11      10.933107 15.49          6519.958       16.7
## 14   0.0714       92.82      11.512925  6.50          4384.000        4.8
## 15   0.0863      209.54       9.487972  9.73          1559.958       44.6
##    inq.last.6mths delinq.2yrs pub.rec credit.policy.F           purposeF
## 1               0           0       0         Fulfill debt_consolidation
## 2               0           0       0         Fulfill        credit_card
## 3               1           0       0         Fulfill debt_consolidation
## 4               1           0       0         Fulfill debt_consolidation
## 5               0           1       0         Fulfill        credit_card
## 6               0           0       0         Fulfill        credit_card
## 7               0           0       1         Fulfill debt_consolidation
## 8               0           0       0         Fulfill          all_other
## 9               1           0       0         Fulfill   home_improvement
## 10              1           0       0         Fulfill debt_consolidation
## 11              2           0       1         Fulfill debt_consolidation
## 12              2           1       0         Fulfill debt_consolidation
## 13              0           0       0         Fulfill debt_consolidation
## 14              0           1       0         Fulfill     small_business
## 15              0           0       0         Fulfill debt_consolidation
##    not.fully.paidF revol.balN
## 1             Paid      28854
## 2             Paid      33623
## 3             Paid       3511
## 4             Paid      33667
## 5             Paid       4740
## 6             Paid      50807
## 7         Not Paid       3839
## 8         Not Paid      24220
## 9             Paid      69909
## 10            Paid       5630
## 11            Paid      13846
## 12            Paid       5122
## 13            Paid       6068
## 14            Paid       3021
## 15            Paid       6282

str(data)   ###structure

## 'data.frame':    9578 obs. of  13 variables:
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ credit.policy.F  : Factor w/ 2 levels "Fulfill","Otherwise": 1 1 1 1 1 1 1 1 1 1 ...
##  $ purposeF         : Factor w/ 7 levels "credit_card",..: 2 1 2 2 1 1 2 6 7 2 ...
##  $ not.fully.paidF  : Factor w/ 2 levels "Not Paid","Paid": 2 2 2 2 2 2 1 1 2 2 ...
##  $ revol.balN       : num  28854 33623 3511 33667 4740 ...

###to round data to 2 decimal munbers (just to make it a bit more nice looking)

data$log.annual.inc <- round(data$log.annual.inc, 2)
data$int.rate <- round(data$int.rate, 2)
data$installment <- round(data$installment, 2)
data$dti <- round(data$dti, 2)
data$days.with.cr.line <- round(data$days.with.cr.line, 2)

head(data)

##   int.rate installment log.annual.inc   dti days.with.cr.line revol.util
## 1     0.12      829.10          11.35 19.48           5639.96       52.1
## 2     0.11      228.22          11.08 14.29           2760.00       76.7
## 3     0.14      366.86          10.37 11.63           4710.00       25.6
## 4     0.10      162.34          11.35  8.10           2699.96       73.2
## 5     0.14      102.92          11.30 14.97           4066.00       39.5
## 6     0.08      125.13          11.90 16.98           6120.04       51.0
##   inq.last.6mths delinq.2yrs pub.rec credit.policy.F           purposeF
## 1              0           0       0         Fulfill debt_consolidation
## 2              0           0       0         Fulfill        credit_card
## 3              1           0       0         Fulfill debt_consolidation
## 4              1           0       0         Fulfill debt_consolidation
## 5              0           1       0         Fulfill        credit_card
## 6              0           0       0         Fulfill        credit_card
##   not.fully.paidF revol.balN
## 1            Paid      28854
## 2            Paid      33623
## 3            Paid       3511
## 4            Paid      33667
## 5            Paid       4740
## 6            Paid      50807

###to find missing values. if there are any, we can remove them with the function "drop_na"

find_missing_values <- function(data) {
  missing <- sum(is.na(data))
  if(missing > 0) {
    cat("Number of missing values in the dataset:", missing, "\n")
    cat("Indices of missing values:\n")
    print(which(is.na(data), arr.ind = TRUE))
  } else {
    cat("No missing values found in the dataset.\n")
  }
}


find_missing_values(data)

## No missing values found in the dataset.

###hovewer, as we can see, no missing values here

summary(data)  ###descriptive statistic for all variables

##     int.rate       installment     log.annual.inc       dti        
##  Min.   :0.0600   Min.   : 15.67   Min.   : 7.55   Min.   : 0.000  
##  1st Qu.:0.1000   1st Qu.:163.77   1st Qu.:10.56   1st Qu.: 7.213  
##  Median :0.1200   Median :268.95   Median :10.93   Median :12.665  
##  Mean   :0.1228   Mean   :319.09   Mean   :10.93   Mean   :12.607  
##  3rd Qu.:0.1400   3rd Qu.:432.76   3rd Qu.:11.29   3rd Qu.:17.950  
##  Max.   :0.2200   Max.   :940.14   Max.   :14.53   Max.   :29.960  
##                                                                    
##  days.with.cr.line   revol.util    inq.last.6mths    delinq.2yrs     
##  Min.   :  179     Min.   :  0.0   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 2820     1st Qu.: 22.6   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 4140     Median : 46.3   Median : 1.000   Median : 0.0000  
##  Mean   : 4561     Mean   : 46.8   Mean   : 1.577   Mean   : 0.1637  
##  3rd Qu.: 5730     3rd Qu.: 70.9   3rd Qu.: 2.000   3rd Qu.: 0.0000  
##  Max.   :17640     Max.   :119.0   Max.   :33.000   Max.   :13.0000  
##                                                                      
##     pub.rec         credit.policy.F               purposeF    not.fully.paidF
##  Min.   :0.00000   Fulfill  :7710   credit_card       :1262   Not Paid:1533  
##  1st Qu.:0.00000   Otherwise:1868   debt_consolidation:3957   Paid    :8045  
##  Median :0.00000                    educational       : 343                  
##  Mean   :0.06212                    major_purchase    : 437                  
##  3rd Qu.:0.00000                    small_business    : 619                  
##  Max.   :5.00000                    all_other         :2331                  
##                                     home_improvement  : 629                  
##    revol.balN     
##  Min.   :      0  
##  1st Qu.:   3187  
##  Median :   8596  
##  Mean   :  16914  
##  3rd Qu.:  18250  
##  Max.   :1207359  
##

summary(data [   , c(-10, -11, -12) ]) ###descriptive statistics without categorical variable

##     int.rate       installment     log.annual.inc       dti        
##  Min.   :0.0600   Min.   : 15.67   Min.   : 7.55   Min.   : 0.000  
##  1st Qu.:0.1000   1st Qu.:163.77   1st Qu.:10.56   1st Qu.: 7.213  
##  Median :0.1200   Median :268.95   Median :10.93   Median :12.665  
##  Mean   :0.1228   Mean   :319.09   Mean   :10.93   Mean   :12.607  
##  3rd Qu.:0.1400   3rd Qu.:432.76   3rd Qu.:11.29   3rd Qu.:17.950  
##  Max.   :0.2200   Max.   :940.14   Max.   :14.53   Max.   :29.960  
##  days.with.cr.line   revol.util    inq.last.6mths    delinq.2yrs     
##  Min.   :  179     Min.   :  0.0   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 2820     1st Qu.: 22.6   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 4140     Median : 46.3   Median : 1.000   Median : 0.0000  
##  Mean   : 4561     Mean   : 46.8   Mean   : 1.577   Mean   : 0.1637  
##  3rd Qu.: 5730     3rd Qu.: 70.9   3rd Qu.: 2.000   3rd Qu.: 0.0000  
##  Max.   :17640     Max.   :119.0   Max.   :33.000   Max.   :13.0000  
##     pub.rec          revol.balN     
##  Min.   :0.00000   Min.   :      0  
##  1st Qu.:0.00000   1st Qu.:   3187  
##  Median :0.00000   Median :   8596  
##  Mean   :0.06212   Mean   :  16914  
##  3rd Qu.:0.00000   3rd Qu.:  18250  
##  Max.   :5.00000   Max.   :1207359

I don’t want to talk about everything, it doesn’t make any sense to me, so I’ll just talk about two variables of choice.

The lowest interest rate that existed in this sample was 0%, the highest was 22%. 1st Qu means that a quarter of people had interest rates of 10% or lower, median means that half of people had interest rates of 12% or lower, and 3rd Qu means that 75% of borrowers had interest rates of 14% or lower and respectively the last quarter of people have interest rates above 14%. “mean” means that the arithmetic average value of the interest rate is 12.28%.

number of days the borrower has had a credit line. the minimum loan period is 179 days, the maximum is 17,640 days. mean means that if we add all the values and divide by the number of observations, then we will find the arithmetic mean, and this arithmetic mean will mean that on average people borrow for 4561 days. 1st Qu means that a quarter of people take a loan for a term of 2820 days or a shorter term. the median means that half of the people borrow for 4,140 days or less. 3rd Qu means that people take debt for 5730 days or less period of time.

###another way to calculate the desired values from descriptive statistics is to use the formulas below. I will calculate only for the borrower's revolving balance variable. 

min(data$revol.balN)

## [1] 0

max(data$revol.balN)

## [1] 1207359

mean(data$revol.balN)

## [1] 16913.96

median(data$revol.balN)

## [1] 8596

sd(data$revol.balN)

## [1] 33756.19

var(data$revol.balN)

## [1] 1139480333

range(data$revol.balN)

## [1]       0 1207359

sum(data$revol.balN)

## [1] 162001946

quantile(data$revol.balN)

##        0%       25%       50%       75%      100% 
##       0.0    3187.0    8596.0   18249.5 1207359.0

quantile(data$revol.balN, p=0.10)

##   10% 
## 710.7

min: the smallest unpaid amount is $0 (i.e. everything is paid)

max: the maximum unpaid amount is $1,207,359

average: The average outstanding amount of the borrower in the sample is $16,913.96.

median: half of the borrowers did not pay the amount unpaid at the end of the credit card billing cycle $8,596 or less.

sd: it means that the values are, on average, 33756.19 units ($) away from the mean.

range: this is the difference between the largest and smallest values. that is, the unpaid amount is the difference between the borrowers by $1,207,359

amount: the total amount of all unpaid funds from all borrowers is $16,200,1946

quantile: A quarter of people did not pay $3,187 or less. Exactly half of the borrowers did not pay $8,596 or less. And 75% of people did not pay $18,249.5 or less

quantile, 10%: This means that 10% of people did not pay an amount of $710.7 or less.

#install.packages("pastecs")

library(pastecs)

round(stat.desc(data[   , c(-10, -11, -12) ]), 2)

##              int.rate installment log.annual.inc       dti days.with.cr.line
## nbr.val       9578.00     9578.00        9578.00   9578.00           9578.00
## nbr.null         0.00        0.00           0.00     89.00              0.00
## nbr.na           0.00        0.00           0.00      0.00              0.00
## min              0.06       15.67           7.55      0.00            178.96
## max              0.22      940.14          14.53     29.96          17639.96
## range            0.16      924.47           6.98     29.96          17461.00
## sum           1175.73  3056238.40      104710.00 120746.77       43683027.92
## median           0.12      268.95          10.93     12.66           4139.96
## mean             0.12      319.09          10.93     12.61           4560.77
## SE.mean          0.00        2.12           0.01      0.07             25.51
## CI.mean.0.95     0.00        4.15           0.01      0.14             50.01
## var              0.00    42878.52           0.38     47.39        6234661.48
## std.dev          0.03      207.07           0.61      6.88           2496.93
## coef.var         0.22        0.65           0.06      0.55              0.55
##              revol.util inq.last.6mths delinq.2yrs pub.rec   revol.balN
## nbr.val         9578.00        9578.00     9578.00 9578.00 9.578000e+03
## nbr.null         297.00        3637.00     8458.00 9019.00 3.210000e+02
## nbr.na             0.00           0.00        0.00    0.00 0.000000e+00
## min                0.00           0.00        0.00    0.00 0.000000e+00
## max              119.00          33.00       13.00    5.00 1.207359e+06
## range            119.00          33.00       13.00    5.00 1.207359e+06
## sum           448243.08       15109.00     1568.00  595.00 1.620019e+08
## median            46.30           1.00        0.00    0.00 8.596000e+03
## mean              46.80           1.58        0.16    0.06 1.691396e+04
## SE.mean            0.30           0.02        0.01    0.00 3.449200e+02
## CI.mean.0.95       0.58           0.04        0.01    0.01 6.761100e+02
## var              841.84           4.84        0.30    0.07 1.139480e+09
## std.dev           29.01           2.20        0.55    0.26 3.375619e+04
## coef.var           0.62           1.39        3.34    4.22 2.000000e+00

###The stat.desc function also calculates basic statistics. round is needed to round the value to two decimal places so that the results look better. I also removed categorical variables, because their statistical description does not make any sense

Here, too, I will explain only one variable - installment.

nbr.val 9578.00 is the number of observations that have a value, that is, which are not missing or not zero. in this particular case, it is all 9578 observations

nbr.null 0.00 and nbr.na 0.00 - this means that we have neither zeros nor missing data in the installment data

min 15.67 - that is, the minimum monthly payment is $15.67

max 940.14 - this means that the borrower’s maximum monthly payment is $940.14

range 924.47 means that the difference between the largest and smallest monthly payments of the sample is $924.47

sum 3056238.40 - that is, for every month all members of the sample taken together must make a payment of $3056238.40 per month.

median 268.95 means that half of borrowers pay $268.95 or less each month.

mean 319.09 means that on average the borrower pays $319.0 per month

library(psych)

describe(data [   , c(-10, -11, -12) ])   ###another function to calculate statistics, without categorical variables as well

##                   vars    n     mean       sd  median  trimmed     mad    min
## int.rate             1 9578     0.12     0.03    0.12     0.12    0.03   0.06
## installment          2 9578   319.09   207.07  268.95   295.64  184.88  15.67
## log.annual.inc       3 9578    10.93     0.61   10.93    10.93    0.55   7.55
## dti                  4 9578    12.61     6.88   12.66    12.59    7.98   0.00
## days.with.cr.line    5 9578  4560.77  2496.93 4139.96  4303.64 2135.06 178.96
## revol.util           6 9578    46.80    29.01   46.30    46.50   35.88   0.00
## inq.last.6mths       7 9578     1.58     2.20    1.00     1.16    1.48   0.00
## delinq.2yrs          8 9578     0.16     0.55    0.00     0.02    0.00   0.00
## pub.rec              9 9578     0.06     0.26    0.00     0.00    0.00   0.00
## revol.balN          10 9578 16913.96 33756.19 8596.00 10809.19 9619.11   0.00
##                          max      range  skew kurtosis     se
## int.rate                0.22       0.16  0.17    -0.21   0.00
## installment           940.14     924.47  0.91     0.14   2.12
## log.annual.inc         14.53       6.98  0.03     1.61   0.01
## dti                    29.96      29.96  0.02    -0.90   0.07
## days.with.cr.line   17639.96   17461.00  1.16     1.94  25.51
## revol.util            119.00     119.00  0.06    -1.12   0.30
## inq.last.6mths         33.00      33.00  3.58    26.27   0.02
## delinq.2yrs            13.00      13.00  6.06    71.38   0.01
## pub.rec                 5.00       5.00  5.12    38.75   0.00
## revol.balN        1207359.00 1207359.00 11.16   259.46 344.92

here I propose to analyze the pub.rec variable, i.e. The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

This is variable (var) number 9 in the output table.

n means the number of observations, which are always 9578

mean, as always, means the arithmetic mean, that is, on average, borrowers have 0.06 public records for each.

sd stands for standard deviation, i.e, it means that the values are, on average, 0,26 units (number of records) away from the mean

median of 0.00 means that half of the people have 0 public records

trimmed refers to a dataset from which a certain proportion of the data points have been removed from one or both ends of the distribution. This is done to reduce the influence of extreme values (outliers) on statistical measures. As we can see, there are no outliers in here.

the minimum and maximum values show the largest and smallest number of public records the borrower has, in this case 0 and 5, respectively. The range shows the spread of the data, what is their spectrum, it is max-min=5-0=5 units.

skewness (skew) means asymmetry of the distribution to the left or to the right from the mean. in this case there is a positive skew, means there is a longer “tail” om the right-hand side.

kurtosis refers to the measure of the tailedness or peakedness of a distribution.in our case, this means heavier tails and a sharper peak compared to a normal distribution

describeBy(data$dti, group = data$purpose)

## 
##  Descriptive statistics by group 
## group: credit_card
##    vars    n mean   sd median trimmed  mad min   max range  skew kurtosis   se
## X1    1 1262 14.1 6.47  14.38   14.21 7.46   0 29.95 29.95 -0.11     -0.8 0.18
## ------------------------------------------------------------ 
## group: debt_consolidation
##    vars    n  mean   sd median trimmed  mad min   max range  skew kurtosis  se
## X1    1 3957 14.08 6.43  14.24   14.18 7.35   0 29.96 29.96 -0.09    -0.77 0.1
## ------------------------------------------------------------ 
## group: educational
##    vars   n  mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 343 11.34 6.94  11.42   11.15 8.29   0 29.74 29.74 0.19    -0.87 0.37
## ------------------------------------------------------------ 
## group: major_purchase
##    vars   n  mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 437 10.16 6.63   9.51    9.91 7.78   0 26.15 26.15 0.26    -0.96 0.32
## ------------------------------------------------------------ 
## group: small_business
##    vars   n  mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 619 10.79 6.93  10.39   10.53 8.09   0 29.21 29.21 0.26    -0.92 0.28
## ------------------------------------------------------------ 
## group: all_other
##    vars    n  mean  sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 2331 11.08 7.1  10.56   10.84 8.51   0 29.9  29.9 0.23    -0.92 0.15
## ------------------------------------------------------------ 
## group: home_improvement
##    vars   n mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 629 10.2 6.78   9.66    9.82 8.04   0 28.17 28.17 0.38    -0.75 0.27

this function calculated the strategy for the variable “dti” for each category from the categorical variable “purpose” separately.

it is more convenient to compare these data through knit. but actually, the output will be in this format anyway

so, we can see that, for example, the most popular purpose is debt_consolidation, with 3957 observations in this category. followed by all_other and credit_card with values of 2331 and 1262 respectively. the least popular goal, on the other hand, is educational, with only 343 observations.

the average debt-to-income ratio also varies by category. well, for credit_card, for example, this ratio is 14.1, meaning that, on average, the debt of those who borrow money for a credit card is 14.1 times higher than their income. This, by the way, is also the highest ratio instead, for example, for those who borrow money for educational purposes, the amount of debt exceeds their income by 11.34 times. proportion has the lowest value for major_purchase and home_improvement and is 10.16 and 10.20, respectively.

the two highest median values are 14.38 and 14.24 for credit_card and debt_consolidation, respectively, which means that half of people who borrow for credit cards have a debt-to-income ratio of 14.38:1 or less, and half of those who borrow for debt consolidation, have a debt-to-income ratio of 14.24:1 or less. By “less” I mean that there are fewer borrowed dollars per $1 of income.

describeBy(data$revol.balN, group = data$credit.policy.F)

## 
##  Descriptive statistics by group 
## group: Fulfill
##    vars    n    mean       sd median trimmed     mad min    max  range skew
## X1    1 7710 13798.4 16878.56 8707.5 10514.1 9423.41   0 149527 149527 2.97
##    kurtosis     se
## X1    12.54 192.22
## ------------------------------------------------------------ 
## group: Otherwise
##    vars    n     mean       sd median  trimmed      mad min     max   range
## X1    1 1868 29773.15 66807.57 8039.5 14032.46 10343.36   0 1207359 1207359
##    skew kurtosis      se
## X1 6.64    79.16 1545.74

so here’s another example of that. how to calculate statistics for each category separately. here such statistics are calculated for the quantitative variable “revol.balN”, which means The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). It is grouped by categories, whether the customer meets the credit underwriting criteria of the platform or not.

let’s analyze here the values that I did not analyze in the previous example.

therefore, the minimum and maximum amount of outstanding debt for those who meet the criteria is 0 USD and 149,527 USD, respectively. While for those who do not meet the criteria, the minimum and maximum values are 0 USD and 1207359 USD respectively.

both skewnesses have a positive value: 2.97 for those who meet the criteria and 6.64 for those who do not meet the criteria. positive skew means that the right “tail” is longer.

kurtosis in both cases is positive, so the distribution deviates from normal upwards. moreover, for those who do not meet the criteria, the value is larger (79.16 vs 12.54), that is, the distribution of those who do not meet is even “higher”.

Independent Sample Test

In order to test units that belongs to two different populations. Each unit in this case in measured once. Each unit belongs to only one population, the samples do not overlap

the independent t-test will be used here, because there are only two categories, and the categories are independent, that is, the populations do not intersect with each other

Research question: whether there is any difference in arithmetic mean in borrower’s revolving balance for those borrowers meets the credit criteria and for those who dont

For this, we selected a sample of 9578 units of observation, which include users (borrowers) of the platform LendingClub.com (platform where private borrowers and lenders meet). One unit of observation is one borrower.

Assumptions ho be hold:

analysed variable must be numeric.
distribution of the variable is normal in both populations.
data must come from two independent populations.
variable has the same variance in both populations. Welch correction otherwise.

In order to do the research of our research question we need two following variables:

revol.balN: borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle, in $), numerical
credit.policy.F: if the customer meets the credit under writing criteria or not, categorical, was turned into a factor before

library(psych)

describeBy(data$revol.balN, data$credit.policy.F)   ###function that helps to make descriptive statistics for each factor variable group separately

## 
##  Descriptive statistics by group 
## group: Fulfill
##    vars    n    mean       sd median trimmed     mad min    max  range skew
## X1    1 7710 13798.4 16878.56 8707.5 10514.1 9423.41   0 149527 149527 2.97
##    kurtosis     se
## X1    12.54 192.22
## ------------------------------------------------------------ 
## group: Otherwise
##    vars    n     mean       sd median  trimmed      mad min     max   range
## X1    1 1868 29773.15 66807.57 8039.5 14032.46 10343.36   0 1207359 1207359
##    skew kurtosis      se
## X1 6.64    79.16 1545.74

Here are the basic descriptive statistics for the revolving balance by category. As we can see, the average outstanding amount for those who meet the criteria of the credit policy is 13798.4, which is about twice as much as the revolving balance of those who do not meet the criteria (29773.15).

Informally, we can assume that there is such a difference, moreover, it is significant.

However, since we are evaluating a sample rather than the entire population, we must perform formal tests before drawing any real conclusions.

The observation units in our sample belong to two different populations. That is, a borrower either meets the criteria or does not, and cannot be in both categories at the same time. Therefore, each unit in the sample was tested only once. Since these are two different populations, we choose an independent t-test.

Lets recall our assumptions once again and test whether they are hold.

Numeric variable. Yes, revolving balance is numeric, expresed in $
Distribution of the variable is normal in both populations. To be tested now

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

Fulfill <- ggplot(data[data$credit.policy.F == "Fulfill",  ], aes(x = revol.balN)) +    ###add dataframe, variable for x-axis and factor to group by
  theme_linedraw() +    ###change theme
  geom_histogram() +    ###type og the graph
  labs(title = "Fulfill Criteria",   ###add title
       x = "Revolving Balance",   ###name x-axis
       y = "Frequency")    ###name y-axis

Not_Fulfill <- ggplot(data[data$credit.policy.F == "Otherwise",  ], aes(x = revol.balN)) +  ###add dataframe, variable for x-axis and factor to group by
  theme_linedraw() +     ###change theme
  geom_histogram() +     ###type of the graph
  labs(title = "Fulfill Criteria",   ###add title
       x = "Revolving Balance",   ###name x-axis
       y = "Frequency")    ###name y-axis

library(ggpubr)
ggarrange(Fulfill, Not_Fulfill,     ###group two graghs into one picture
          ncol = 2, nrow = 1)      ###2 columns, 1 row, means place them nearby horizontally

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Our sample is really large, so that its possinle to draw conclusions from the graphs alone. As we can see, this is not a normal distribution in either case. Both graphs are very much right-skewed, with the smallest value having the highest frequency.

Although you can do without the Shapiro-Wilk test here, let me do it just to practice.

Shapiro_Will test can only be performed on small sample sizes from 3 to 5000. My sample is almost twice as large. Therefore, it should be shortened. To ensure unbiased selection of observation units, we use the following R command to select 4500 units at random.

small_sample <- data[sample(nrow(data), 4500), ]    

summary(small_sample)

##     int.rate       installment     log.annual.inc       dti        
##  Min.   :0.0600   Min.   : 15.69   Min.   : 7.60   Min.   : 0.000  
##  1st Qu.:0.1000   1st Qu.:163.14   1st Qu.:10.57   1st Qu.: 7.185  
##  Median :0.1200   Median :268.49   Median :10.93   Median :12.620  
##  Mean   :0.1226   Mean   :316.88   Mean   :10.94   Mean   :12.572  
##  3rd Qu.:0.1400   3rd Qu.:421.89   3rd Qu.:11.29   3rd Qu.:17.850  
##  Max.   :0.2200   Max.   :940.14   Max.   :14.53   Max.   :29.740  
##                                                                    
##  days.with.cr.line   revol.util     inq.last.6mths   delinq.2yrs     
##  Min.   :  180     Min.   :  0.00   Min.   : 0.00   Min.   : 0.0000  
##  1st Qu.: 2820     1st Qu.: 22.68   1st Qu.: 0.00   1st Qu.: 0.0000  
##  Median : 4140     Median : 46.30   Median : 1.00   Median : 0.0000  
##  Mean   : 4544     Mean   : 46.83   Mean   : 1.58   Mean   : 0.1649  
##  3rd Qu.: 5700     3rd Qu.: 70.70   3rd Qu.: 2.00   3rd Qu.: 0.0000  
##  Max.   :17616     Max.   :106.50   Max.   :28.00   Max.   :11.0000  
##                                                                      
##     pub.rec         credit.policy.F               purposeF    not.fully.paidF
##  Min.   :0.00000   Fulfill  :3616   credit_card       : 584   Not Paid: 731  
##  1st Qu.:0.00000   Otherwise: 884   debt_consolidation:1857   Paid    :3769  
##  Median :0.00000                    educational       : 159                  
##  Mean   :0.05844                    major_purchase    : 218                  
##  3rd Qu.:0.00000                    small_business    : 271                  
##  Max.   :3.00000                    all_other         :1113                  
##                                     home_improvement  : 298                  
##    revol.balN    
##  Min.   :     0  
##  1st Qu.:  3124  
##  Median :  8670  
##  Mean   : 16873  
##  3rd Qu.: 18550  
##  Max.   :952013  
##

summary(data)

##     int.rate       installment     log.annual.inc       dti        
##  Min.   :0.0600   Min.   : 15.67   Min.   : 7.55   Min.   : 0.000  
##  1st Qu.:0.1000   1st Qu.:163.77   1st Qu.:10.56   1st Qu.: 7.213  
##  Median :0.1200   Median :268.95   Median :10.93   Median :12.665  
##  Mean   :0.1228   Mean   :319.09   Mean   :10.93   Mean   :12.607  
##  3rd Qu.:0.1400   3rd Qu.:432.76   3rd Qu.:11.29   3rd Qu.:17.950  
##  Max.   :0.2200   Max.   :940.14   Max.   :14.53   Max.   :29.960  
##                                                                    
##  days.with.cr.line   revol.util    inq.last.6mths    delinq.2yrs     
##  Min.   :  179     Min.   :  0.0   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 2820     1st Qu.: 22.6   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 4140     Median : 46.3   Median : 1.000   Median : 0.0000  
##  Mean   : 4561     Mean   : 46.8   Mean   : 1.577   Mean   : 0.1637  
##  3rd Qu.: 5730     3rd Qu.: 70.9   3rd Qu.: 2.000   3rd Qu.: 0.0000  
##  Max.   :17640     Max.   :119.0   Max.   :33.000   Max.   :13.0000  
##                                                                      
##     pub.rec         credit.policy.F               purposeF    not.fully.paidF
##  Min.   :0.00000   Fulfill  :7710   credit_card       :1262   Not Paid:1533  
##  1st Qu.:0.00000   Otherwise:1868   debt_consolidation:3957   Paid    :8045  
##  Median :0.00000                    educational       : 343                  
##  Mean   :0.06212                    major_purchase    : 437                  
##  3rd Qu.:0.00000                    small_business    : 619                  
##  Max.   :5.00000                    all_other         :2331                  
##                                     home_improvement  : 629                  
##    revol.balN     
##  Min.   :      0  
##  1st Qu.:   3187  
##  Median :   8596  
##  Mean   :  16914  
##  3rd Qu.:  18250  
##  Max.   :1207359  
##

with the command summary I propose to make basic descriptive statistics and compare the values of the main sample and the reduced sample. as we can see data were selected randomly and therefore are unbiased. in the reduced sample there are small deviations, but such deviations are small and will not harm further analysis

Now let’s do the Shapiro-Wilk test for the reduces sample

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

small_sample <- data[sample(nrow(data), 4500), ]  

small_sample %>%
  group_by(credit.policy.F) %>%
  shapiro_test(revol.balN)

## # A tibble: 2 × 4
##   credit.policy.F variable   statistic        p
##   <fct>           <chr>          <dbl>    <dbl>
## 1 Fulfill         revol.balN     0.697 1.08e-62
## 2 Otherwise       revol.balN     0.418 1.71e-46

lets set hypothesis for the Shapiro-Wilk test

H0: data is normally distributed

H1: data is not normally distributed

as we conduct a test for two groups, then we have to evaluate the hypotheses for each group separately

for the group “Fulfill”, the p-value is 6.038402e-63. means we reject H0 at p-value < 0.001, hence our data is not normally distributed

for the group “Otherwise”, p-value is 6.933171e-43/ means we reject h0 at p-value < 0.001. hence our data is not normally distributed

this was the second confirmation that normality is absent, that is, the distribution of the variable is not normal. Therefore, we have to conduct a non-parametric test, that is, the Wilcoxon Rank Sum test

Due to the fact that the data do not satisfy one of the conditions, namely that they do not have a normal distribution, we have to conduct a non-parametric test, namely the Wilcoxon rank sum test.

let’s set our hypothesis for the Wilcoxon rank sum test:

H0: locations of distributions of the borrower’s revolving balance are the same for those who fulfill credit policy and for those who don’t

H1: locations of distributions are not the same

wilcox.test(data$revol.balN ~ data$credit.policy.F,    ###add tested variable and factor variable 
            paired = FALSE,    ###because its independent but not paired t-test
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")    ###because be check if medians are the same or not. if we'd like to know if one of them if biger or smaller, we have to write "greater" of "less"

## 
##  Wilcoxon rank sum test
## 
## data:  data$revol.balN by data$credit.policy.F
## W = 7104372, p-value = 0.3668
## alternative hypothesis: true location shift is not equal to 0

p-value = 0.3668, which is greater than 0.05, so we cannot reject H0.

Hence, we cannot reject H0.

Therefore, we cannot reject the option that locations of distributions of the borrower’s revolving balance are the same for those who fulfill credit policy and for those who don’t

in order to estimate how strongly compliance or non-compliance with the criteria affects the revolving balance, let’s conduct a size effect test

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(data$revol.balN ~ data$credit.policy.F,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## -0.01             | [-0.04, 0.02]

in absolute terms its effect size is equal to 0.01

if we want to estimate it manually, then 0.01 < 0.05, so the effect size is tiny.

one can also use R to determine how large the effect size is. for this we will use the following function

interpret_rank_biserial(0.01)

## [1] "tiny"
## (Rules: funder2019)

therefore, it is confirmed, effect size is tiny

Conclusion: As a conclusion, we can say for sure that locations of distributions of the borrower’s revolving balance are not the same (cannot reject H0, as p-value > 0.05) for those who fulfill credit policy and for those who don’t, so we can’t say for sure that the medians are different. Compliance with the credit policy has only a tiny effect (0.01) on the amount of the borrower’s revolving balance.

ANOVA test

here we will compare arithmetic means of independent samples (means observation units from different population do not intersect). this is an extension of independent sample t-test

anova will be used for the reason that the categories are independent, that is, the populations do not overlap with each other and there are more than two such categories

research question: whether there is a difference in the amount of instaalment between different categories of people with different purposes for borrowing

variables are necessary to investigate the research question:

installment: The monthly installments owed by the borrower if the loan is funded in $, numerical.
purpose: The purpose of the loan, categorical, nominal.

as the samples do not overlap, that is, one borrower has only one purpose of the loan, and there are more than two categories, we will conduct ANOVA test

let’s set assumptions:

Analyzed variable is numeric. the condition is true because the installment is measured in dollars and is numeric
Variable in the population is normally distributed within each group. Use non-parametric test, if violated
Homoscedasticity: the variance of analyzed variable is the same within all groups. if this is violated >> still parametric but with Welch correction

library(psych)

describeBy(x = data$installment, group = data$purposeF)

## 
##  Descriptive statistics by group 
## group: credit_card
##    vars    n  mean     sd median trimmed    mad   min    max  range skew
## X1    1 1262 319.5 198.23 266.67  296.87 169.88 16.73 922.42 905.69 0.95
##    kurtosis   se
## X1     0.27 5.58
## ------------------------------------------------------------ 
## group: debt_consolidation
##    vars    n   mean     sd median trimmed    mad   min    max  range skew
## X1    1 3957 358.98 198.31 325.08  342.16 203.21 23.21 940.14 916.93  0.7
##    kurtosis   se
## X1    -0.17 3.15
## ------------------------------------------------------------ 
## group: educational
##    vars   n   mean     sd median trimmed    mad   min    max  range skew
## X1    1 343 217.55 168.51 169.62  190.44 125.74 15.67 861.88 846.21 1.59
##    kurtosis  se
## X1     2.65 9.1
## ------------------------------------------------------------ 
## group: major_purchase
##    vars   n   mean     sd median trimmed    mad   min    max  range skew
## X1    1 437 243.48 179.32 198.78  212.63 119.11 30.94 898.55 867.61 1.69
##    kurtosis   se
## X1     2.74 8.58
## ------------------------------------------------------------ 
## group: small_business
##    vars   n   mean     sd median trimmed    mad   min    max  range skew
## X1    1 619 433.83 248.59 394.36   422.7 276.49 16.25 926.83 910.58 0.39
##    kurtosis   se
## X1    -0.99 9.99
## ------------------------------------------------------------ 
## group: all_other
##    vars    n   mean     sd median trimmed    mad   min    max  range skew
## X1    1 2331 244.94 184.27 190.63  215.95 139.17 15.69 916.95 901.26 1.44
##    kurtosis   se
## X1     1.81 3.82
## ------------------------------------------------------------ 
## group: home_improvement
##    vars   n   mean     sd median trimmed    mad   min    max  range skew
## X1    1 629 337.07 222.11  282.4  313.57 195.67 28.47 902.06 873.59 0.83
##    kurtosis   se
## X1    -0.24 8.86

here we can see 7 different categories grouped by categorical variable Purpose which was turned into factor earlier

parameters do not differ very much between groups

so, the largest max value we see for the purpose of debt consolidation (940.14 USD) while the smallest max value is present for the purpose of education (861.88 USD)

the largest minimum value among the categories is observed for the category education and other reasons and is 16.7 dollars, while the highest minimum value is for the major purchases and is 30.9 dollars

we can conclude the following regarding teh mean. the highest average installment are for goal home improvment, credit card and debt consolidation while the smallest is for education

now we need to check Homoskedastisity of Variance , i.e. if variances are equal. this test we will find out whether we need a welch correction

we have a preliminary result regarding homascedasticity, we can make it by comparing the variances from the descriptive statistics above, we take it in the SD column. they should not be absolutely identical, but they should be +/- the same

but as we can see, these values are very different, for example, for the education group sd is 168.5 and for group small business the value is equal to 248.59. even this differs already by a third. also smaller values of sd are characteristic of such groups as major purchases (179) and all other reasons (184), while large sd values are for home improvement (222)

So we can make a preliminary assumption that homoscedasticity will be violated

but for a more accurate result, let’s conduct a Levene test

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

leveneTest(data$installment, group = data$purposeF)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    6  38.588 < 2.2e-16 ***
##       9571                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: sigma ^2 (1) = sigma ^2 (2) = sigma ^2 (3)

H1: at least one differs

we reject H0 ar p<0.001

as homoscedasticity is violated, we need Welch correction

to check for normality there are two ways to do this

Let’s start with a boxplot

ggplot(data, aes(x=installment,   ###add variable
                 fill=purposeF))+   ###group by categories
  geom_boxplot() +   ###type of th egraph
  xlab("Installment") +   ###x-axis name
  labs(fill="Purpose of Credit",   ###add legent 
       title = "Installment based on Purpose")   ###add title

boxplot looks like none of the groups are normally distributed

we can observe outliers which are marked with bold dots on the graph for such groups as all other reasons, big purchases, education, debt consolidation and credit card

also all groups are right (positively) skewed, since the median as well as the first and third quartiles are shifted to the left. even removing the outlayer does not seem to correct the situation. only the small business group looks more or less normally distributed

in order to make sure that normality is violated, let’s conduct the Shapiro-Wilk test

library(dplyr)
library(rstatix)

data %>%
  group_by(purposeF) %>%
  shapiro_test(installment)

## # A tibble: 7 × 4
##   purposeF           variable    statistic        p
##   <fct>              <chr>           <dbl>    <dbl>
## 1 credit_card        installment     0.922 4.65e-25
## 2 debt_consolidation installment     0.951 1.77e-34
## 3 educational        installment     0.849 1.09e-17
## 4 major_purchase     installment     0.827 2.17e-21
## 5 small_business     installment     0.943 9.91e-15
## 6 all_other          installment     0.862 3.47e-41
## 7 home_improvement   installment     0.916 3.33e-18

H0: variable is normally distributed

H1: is not

for each of these groups, we reject h0 at p-value < 0.001, therefore, the data in none of the groups are normally distributed

as normality is violated we have to go for Krushkal-Wallis Rank Sum Test

KRUSKAL-WALLIS RANK SUM TEST

Comparison of three or more distribution locations of variables for independent samples (extension of the Wilcoxon Rank Sum Test).

in this test we are going to compare if there is any difference in distribution locations of variables installment for each group of categorical variable purpose

library(rstatix)

data %>%
  group_by(purposeF) %>%
  get_summary_stats(installment, type = "median_iqr")

## # A tibble: 7 × 5
##   purposeF           variable        n median   iqr
##   <fct>              <fct>       <dbl>  <dbl> <dbl>
## 1 credit_card        installment  1262   267.  255.
## 2 debt_consolidation installment  3957   325.  290.
## 3 educational        installment   343   170.  179.
## 4 major_purchase     installment   437   199.  172.
## 5 small_business     installment   619   394.  402.
## 6 all_other          installment  2331   191.  208.
## 7 home_improvement   installment   629   282.  328.

in this test we are particularly interested in comparing medians

as we can see from the statistics above for example for educational purpose Maiden is equal to 169.6 and in the same time the median for small businesses is equal to 394.36 that’s a huge difference, more than twice bigger

in general we can see very low medians as for educational purpose or major purchases and very large medians as for debt consolidation, small businesses or Home Improvement

now let’s run the Krushkal-Wallis Rand Sum test

kruskal.test(installment ~ purposeF, 
             data = data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  installment by purposeF
## Kruskal-Wallis chi-squared = 940.39, df = 6, p-value < 2.2e-16

let’s set our hypothesis:

H0: All distribution locations of variables are the same

H1: At least one distribution location of variable is different

We reject H0 at p<0.001

at least one distribution location differ from others

now let’s check which exactly group differs

library(rstatix)

groups_nonpar <- wilcox_test(installment ~ purposeF,
                             paired = FALSE,     ###means test in not paired t-test, but independent one
                             p.adjust.method = "bonferroni",     ###when there are more than 2 categories, its better to do an adjustment of p
                             data = data)

groups_nonpar

## # A tibble: 21 × 9
##    .y.      group1 group2    n1    n2 statistic         p     p.adj p.adj.signif
##  * <chr>    <chr>  <chr>  <int> <int>     <dbl>     <dbl>     <dbl> <chr>       
##  1 install… credi… debt_…  1262  3957  2164308  9.66e- 13 2.03e- 11 ****        
##  2 install… credi… educa…  1262   343   291454  6.43e- 23 1.35e- 21 ****        
##  3 install… credi… major…  1262   437   347766  3.71e- 16 7.79e- 15 ****        
##  4 install… credi… small…  1262   619   285308  1.88e- 21 3.95e- 20 ****        
##  5 install… credi… all_o…  1262  2331  1848241  4.95e- 37 1.04e- 35 ****        
##  6 install… credi… home_…  1262   629   388231  4.38e-  1 1   e+  0 ns          
##  7 install… debt_… educa…  3957   343   991976  8.28e- 46 1.74e- 44 ****        
##  8 install… debt_… major…  3957   437  1203204. 2.89e- 41 6.07e- 40 ****        
##  9 install… debt_… small…  3957   619  1026402. 8.73e- 11 1.83e-  9 ****        
## 10 install… debt_… all_o…  3957  2331  6355142. 9.79e-139 2.06e-137 ****        
## # ℹ 11 more rows

assumptions will be similar to those from non-parametric version of independent test, Wilcoxon Rank Sum test

H0: locations of distributions of variable installment are the same for both groups of categorical variable purpose

H1: are not the same

remark: here we are comparing locations of distributions for each pair of separately

as we can see , p-value is less than Alpha (0.05) in most cases. that is, in most cases we reject H0 and accept H1. Only in certain exceptions we cannot reject h0

let me list those where we cannot reject h0 for p-value > 0.05

in the following cases we reject we reject h0:

credit_card - home_improvement

educational - major_purchase

educational - all_other

major_purchase - all_other

in these four pairs of categories, the locations of distribution of the installmen are the same for both categories

we reject H0 at p < 0.001 in all other cases

means in all other cases the locations of the distributions of installmen are Not the same for

Now let’s calculate what effect the purpose of borrowing has on the size of installment. for this, we will conduct the following test

kruskal_effsize(installment ~ purposeF, 
                data = data)

## # A tibble: 1 × 5
##   .y.             n effsize method  magnitude
## * <chr>       <int>   <dbl> <chr>   <ord>    
## 1 installment  9578  0.0976 eta2[H] moderate

as we can see the effect size is 0.0976 which is determined as moderate effect size

Conclusion: as a result of the conducted test, we learned that there is a difference (H0 rejected at p < 0.001) between the distribution locations of the installment in different groups. this difference exists in almost all pairs of categories except the following: credit_card - home_improvement, educational - major_purchase, educational - all_other, major_purchase - all_other. purpose has a moderate effect on installment amount

hw2

2024-04-16