hw1

Author: Polina Kamenchuk. Student ID: 12300685

##to upload data set into R.

data <- read.table("./loan_data.csv", ###name of the file 
                     header = TRUE,  ###first row as a header 
                     sep = ",",  ###columns are separated from each other with ","
                     dec = ".")    ###"." is used in decimal numbers

str(data)   ###to see the data structure

## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ purpose          : chr  "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...

head(data)   ###to display the first 6 rows

##   credit.policy            purpose int.rate installment log.annual.inc   dti
## 1             1 debt_consolidation   0.1189      829.10       11.35041 19.48
## 2             1        credit_card   0.1071      228.22       11.08214 14.29
## 3             1 debt_consolidation   0.1357      366.86       10.37349 11.63
## 4             1 debt_consolidation   0.1008      162.34       11.35041  8.10
## 5             1        credit_card   0.1426      102.92       11.29973 14.97
## 6             1        credit_card   0.0788      125.13       11.90497 16.98
##   fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs
## 1  737          5639.958     28854       52.1              0           0
## 2  707          2760.000     33623       76.7              0           0
## 3  682          4710.000      3511       25.6              1           0
## 4  712          2699.958     33667       73.2              1           0
## 5  667          4066.000      4740       39.5              0           1
## 6  727          6120.042     50807       51.0              0           0
##   pub.rec not.fully.paid
## 1       0              0
## 2       0              0
## 3       0              0
## 4       0              0
## 5       0              0
## 6       0              0

DATA DESCRIPTION

data set was taken from the Kaggle website (https://www.kaggle.com/datasets/itssuru/loan-data)

sample contains the users (borrowers) of the platform LendingClub.com (platform where private borrowers and lenders meet) sample size is 9578 observations (it can be seen from data in Environment)

unit of observation is one such borrower

description of each variable: - credit.policy: 1 if the customer meets the credit underwriting criteria, and 0 if not, categorical. - purpose: The purpose of the loan, categorical, nominal. - int.rate: The interest rate of the loan, as a proportion, means 10% = 0,10, numerical. - installment: The monthly installments owed by the borrower if the loan is funded in $, numerical. - log.annual.inc: The natural log of the self-reported annual income of the borrower, numerical. - dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income, in $). - fico: The FICO credit score of the borrower, in scores, numerical. - days.with.cr.line: The number of days the borrower has had a credit line, number of days, numerical. - revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle, in$). - revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available, , numerical). - inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months, amount, numerical. - delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years, numerical. - pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

###i dont really understand this fico score and i accume there are enough variables to analyse, so let me remove it from the data set :)

data <- data[, -7 ]

head(data)

##   credit.policy            purpose int.rate installment log.annual.inc   dti
## 1             1 debt_consolidation   0.1189      829.10       11.35041 19.48
## 2             1        credit_card   0.1071      228.22       11.08214 14.29
## 3             1 debt_consolidation   0.1357      366.86       10.37349 11.63
## 4             1 debt_consolidation   0.1008      162.34       11.35041  8.10
## 5             1        credit_card   0.1426      102.92       11.29973 14.97
## 6             1        credit_card   0.0788      125.13       11.90497 16.98
##   days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec
## 1          5639.958     28854       52.1              0           0       0
## 2          2760.000     33623       76.7              0           0       0
## 3          4710.000      3511       25.6              1           0       0
## 4          2699.958     33667       73.2              1           0       0
## 5          4066.000      4740       39.5              0           1       0
## 6          6120.042     50807       51.0              0           0       0
##   not.fully.paid
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0

calculate_number_of_variables <- function(data) {
  num_variables <- ncol(data)
  cat("Number of variables in the dataset:", num_variables, "\n")
}

calculate_number_of_variables(data)

## Number of variables in the dataset: 13

and also there are 13 variables which can be also seen from the Environment > data. but also can be calculated with function above

data$credit.policy.F <- factor(data$credit.policy, 
                               levels = c(1, 0), 
                               labels = c("Fulfill", "Oterwise"))  ###to convert categorical into factor


data$purposeF <- factor(data$purpose, 
                               levels = c("credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", "all_other", "home_improvement"), 
                               labels = c("credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", "all_other", "home_improvement"))    ###to convert categorical into factor


data$not.fully.paidF <- factor(data$not.fully.paid, 
                               levels = c(1, 0), 
                               labels = c("Not Paid", "Paid"))     ###to convert categorical into factor

data$revol.balN <- as.numeric(data$revol.bal)     ###to convert integer to factor as i would say the amount unpaid is in dollars so in should be numerical 


data <- data[ ,  c(-1, -2, -8, -13)   ]  ### i would also delete variables before transformation, because they make no sense and we have new correct one

head(data, 15)   ### display first 15 rows

##    int.rate installment log.annual.inc   dti days.with.cr.line revol.util
## 1    0.1189      829.10      11.350407 19.48          5639.958       52.1
## 2    0.1071      228.22      11.082143 14.29          2760.000       76.7
## 3    0.1357      366.86      10.373491 11.63          4710.000       25.6
## 4    0.1008      162.34      11.350407  8.10          2699.958       73.2
## 5    0.1426      102.92      11.299732 14.97          4066.000       39.5
## 6    0.0788      125.13      11.904968 16.98          6120.042       51.0
## 7    0.1496      194.02      10.714418  4.00          3180.042       76.8
## 8    0.1114      131.22      11.002100 11.08          5116.000       68.6
## 9    0.1134       87.19      11.407565 17.25          3989.000       51.1
## 10   0.1221       84.12      10.203592 10.00          2730.042       23.0
## 11   0.1347      360.43      10.434116 22.09          6713.042       71.0
## 12   0.1324      253.58      11.835009  9.16          4298.000       18.2
## 13   0.0859      316.11      10.933107 15.49          6519.958       16.7
## 14   0.0714       92.82      11.512925  6.50          4384.000        4.8
## 15   0.0863      209.54       9.487972  9.73          1559.958       44.6
##    inq.last.6mths delinq.2yrs pub.rec credit.policy.F           purposeF
## 1               0           0       0         Fulfill debt_consolidation
## 2               0           0       0         Fulfill        credit_card
## 3               1           0       0         Fulfill debt_consolidation
## 4               1           0       0         Fulfill debt_consolidation
## 5               0           1       0         Fulfill        credit_card
## 6               0           0       0         Fulfill        credit_card
## 7               0           0       1         Fulfill debt_consolidation
## 8               0           0       0         Fulfill          all_other
## 9               1           0       0         Fulfill   home_improvement
## 10              1           0       0         Fulfill debt_consolidation
## 11              2           0       1         Fulfill debt_consolidation
## 12              2           1       0         Fulfill debt_consolidation
## 13              0           0       0         Fulfill debt_consolidation
## 14              0           1       0         Fulfill     small_business
## 15              0           0       0         Fulfill debt_consolidation
##    not.fully.paidF revol.balN
## 1             Paid      28854
## 2             Paid      33623
## 3             Paid       3511
## 4             Paid      33667
## 5             Paid       4740
## 6             Paid      50807
## 7         Not Paid       3839
## 8         Not Paid      24220
## 9             Paid      69909
## 10            Paid       5630
## 11            Paid      13846
## 12            Paid       5122
## 13            Paid       6068
## 14            Paid       3021
## 15            Paid       6282

str(data)   ###structure

## 'data.frame':    9578 obs. of  13 variables:
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ credit.policy.F  : Factor w/ 2 levels "Fulfill","Oterwise": 1 1 1 1 1 1 1 1 1 1 ...
##  $ purposeF         : Factor w/ 7 levels "credit_card",..: 2 1 2 2 1 1 2 6 7 2 ...
##  $ not.fully.paidF  : Factor w/ 2 levels "Not Paid","Paid": 2 2 2 2 2 2 1 1 2 2 ...
##  $ revol.balN       : num  28854 33623 3511 33667 4740 ...

###to round data to 2 decimal munbers (just to make it a bit more nice looking)

data$log.annual.inc <- round(data$log.annual.inc, 2)
data$int.rate <- round(data$int.rate, 2)
data$installment <- round(data$installment, 2)
data$dti <- round(data$dti, 2)
data$days.with.cr.line <- round(data$days.with.cr.line, 2)

head(data)

##   int.rate installment log.annual.inc   dti days.with.cr.line revol.util
## 1     0.12      829.10          11.35 19.48           5639.96       52.1
## 2     0.11      228.22          11.08 14.29           2760.00       76.7
## 3     0.14      366.86          10.37 11.63           4710.00       25.6
## 4     0.10      162.34          11.35  8.10           2699.96       73.2
## 5     0.14      102.92          11.30 14.97           4066.00       39.5
## 6     0.08      125.13          11.90 16.98           6120.04       51.0
##   inq.last.6mths delinq.2yrs pub.rec credit.policy.F           purposeF
## 1              0           0       0         Fulfill debt_consolidation
## 2              0           0       0         Fulfill        credit_card
## 3              1           0       0         Fulfill debt_consolidation
## 4              1           0       0         Fulfill debt_consolidation
## 5              0           1       0         Fulfill        credit_card
## 6              0           0       0         Fulfill        credit_card
##   not.fully.paidF revol.balN
## 1            Paid      28854
## 2            Paid      33623
## 3            Paid       3511
## 4            Paid      33667
## 5            Paid       4740
## 6            Paid      50807

###to find missing values. if there are any, we can remove them with the function "drop_na"

find_missing_values <- function(data) {
  missing <- sum(is.na(data))
  if(missing > 0) {
    cat("Number of missing values in the dataset:", missing, "\n")
    cat("Indices of missing values:\n")
    print(which(is.na(data), arr.ind = TRUE))
  } else {
    cat("No missing values found in the dataset.\n")
  }
}


find_missing_values(data)

## No missing values found in the dataset.

###hovewer, as we can see, no missing values here

summary(data)  ###descriptive statistic for all variables

##     int.rate       installment     log.annual.inc       dti        
##  Min.   :0.0600   Min.   : 15.67   Min.   : 7.55   Min.   : 0.000  
##  1st Qu.:0.1000   1st Qu.:163.77   1st Qu.:10.56   1st Qu.: 7.213  
##  Median :0.1200   Median :268.95   Median :10.93   Median :12.665  
##  Mean   :0.1228   Mean   :319.09   Mean   :10.93   Mean   :12.607  
##  3rd Qu.:0.1400   3rd Qu.:432.76   3rd Qu.:11.29   3rd Qu.:17.950  
##  Max.   :0.2200   Max.   :940.14   Max.   :14.53   Max.   :29.960  
##                                                                    
##  days.with.cr.line   revol.util    inq.last.6mths    delinq.2yrs     
##  Min.   :  179     Min.   :  0.0   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 2820     1st Qu.: 22.6   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 4140     Median : 46.3   Median : 1.000   Median : 0.0000  
##  Mean   : 4561     Mean   : 46.8   Mean   : 1.577   Mean   : 0.1637  
##  3rd Qu.: 5730     3rd Qu.: 70.9   3rd Qu.: 2.000   3rd Qu.: 0.0000  
##  Max.   :17640     Max.   :119.0   Max.   :33.000   Max.   :13.0000  
##                                                                      
##     pub.rec        credit.policy.F               purposeF    not.fully.paidF
##  Min.   :0.00000   Fulfill :7710   credit_card       :1262   Not Paid:1533  
##  1st Qu.:0.00000   Oterwise:1868   debt_consolidation:3957   Paid    :8045  
##  Median :0.00000                   educational       : 343                  
##  Mean   :0.06212                   major_purchase    : 437                  
##  3rd Qu.:0.00000                   small_business    : 619                  
##  Max.   :5.00000                   all_other         :2331                  
##                                    home_improvement  : 629                  
##    revol.balN     
##  Min.   :      0  
##  1st Qu.:   3187  
##  Median :   8596  
##  Mean   :  16914  
##  3rd Qu.:  18250  
##  Max.   :1207359  
##

summary(data [   , c(-10, -11, -12) ]) ###descriptive statistics without categorical variable

##     int.rate       installment     log.annual.inc       dti        
##  Min.   :0.0600   Min.   : 15.67   Min.   : 7.55   Min.   : 0.000  
##  1st Qu.:0.1000   1st Qu.:163.77   1st Qu.:10.56   1st Qu.: 7.213  
##  Median :0.1200   Median :268.95   Median :10.93   Median :12.665  
##  Mean   :0.1228   Mean   :319.09   Mean   :10.93   Mean   :12.607  
##  3rd Qu.:0.1400   3rd Qu.:432.76   3rd Qu.:11.29   3rd Qu.:17.950  
##  Max.   :0.2200   Max.   :940.14   Max.   :14.53   Max.   :29.960  
##  days.with.cr.line   revol.util    inq.last.6mths    delinq.2yrs     
##  Min.   :  179     Min.   :  0.0   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 2820     1st Qu.: 22.6   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 4140     Median : 46.3   Median : 1.000   Median : 0.0000  
##  Mean   : 4561     Mean   : 46.8   Mean   : 1.577   Mean   : 0.1637  
##  3rd Qu.: 5730     3rd Qu.: 70.9   3rd Qu.: 2.000   3rd Qu.: 0.0000  
##  Max.   :17640     Max.   :119.0   Max.   :33.000   Max.   :13.0000  
##     pub.rec          revol.balN     
##  Min.   :0.00000   Min.   :      0  
##  1st Qu.:0.00000   1st Qu.:   3187  
##  Median :0.00000   Median :   8596  
##  Mean   :0.06212   Mean   :  16914  
##  3rd Qu.:0.00000   3rd Qu.:  18250  
##  Max.   :5.00000   Max.   :1207359

I don’t want to talk about everything, it doesn’t make any sense to me, so I’ll just talk about two variables of choice.

The lowest interest rate that existed in this sample was 0%, the highest was 22%. 1st Qu means that a quarter of people had interest rates of 10% or lower, median means that half of people had interest rates of 12% or lower, and 3rd Qu means that 75% of borrowers had interest rates of 14% or lower and respectively the last quarter of people have interest rates above 14%. “mean” means that the arithmetic average value of the interest rate is 12.28%.

number of days the borrower has had a credit line. the minimum loan period is 179 days, the maximum is 17,640 days. mean means that if we add all the values and divide by the number of observations, then we will find the arithmetic mean, and this arithmetic mean will mean that on average people borrow for 4561 days. 1st Qu means that a quarter of people take a loan for a term of 2820 days or a shorter term. the median means that half of the people borrow for 4,140 days or less. 3rd Qu means that people take debt for 5730 days or less period of time.

###another way to calculate the desired values from descriptive statistics is to use the formulas below. I will calculate only for the borrower's revolving balance variable. 

min(data$revol.balN)

## [1] 0

max(data$revol.balN)

## [1] 1207359

mean(data$revol.balN)

## [1] 16913.96

median(data$revol.balN)

## [1] 8596

sd(data$revol.balN)

## [1] 33756.19

var(data$revol.balN)

## [1] 1139480333

range(data$revol.balN)

## [1]       0 1207359

sum(data$revol.balN)

## [1] 162001946

quantile(data$revol.balN)

##        0%       25%       50%       75%      100% 
##       0.0    3187.0    8596.0   18249.5 1207359.0

quantile(data$revol.balN, p=0.10)

##   10% 
## 710.7

min: the smallest unpaid amount is $0 (i.e. everything is paid) max: the maximum unpaid amount is $1,207,359 average: The average outstanding amount of the borrower in the sample is $16,913.96. median: half of the borrowers did not pay the amount unpaid at the end of the credit card billing cycle $8,596 or less. sd: it means that the values are, on average, 33756.19 units ($) away from the mean. range: this is the difference between the largest and smallest values. that is, the unpaid amount is the difference between the borrowers by $1,207,359 amount: the total amount of all unpaid funds from all borrowers is $16,200,1946 quantile: A quarter of people did not pay $3,187 or less. Exactly half of the borrowers did not pay $8,596 or less. And 75% of people did not pay $18,249.5 or less quantile, 10%: This means that 10% of people did not pay an amount of $710.7 or less.

#install.packages("pastecs")

library(pastecs)

round(stat.desc(data[   , c(-10, -11, -12) ]), 2)

##              int.rate installment log.annual.inc       dti days.with.cr.line
## nbr.val       9578.00     9578.00        9578.00   9578.00           9578.00
## nbr.null         0.00        0.00           0.00     89.00              0.00
## nbr.na           0.00        0.00           0.00      0.00              0.00
## min              0.06       15.67           7.55      0.00            178.96
## max              0.22      940.14          14.53     29.96          17639.96
## range            0.16      924.47           6.98     29.96          17461.00
## sum           1175.73  3056238.40      104710.00 120746.77       43683027.92
## median           0.12      268.95          10.93     12.66           4139.96
## mean             0.12      319.09          10.93     12.61           4560.77
## SE.mean          0.00        2.12           0.01      0.07             25.51
## CI.mean.0.95     0.00        4.15           0.01      0.14             50.01
## var              0.00    42878.52           0.38     47.39        6234661.48
## std.dev          0.03      207.07           0.61      6.88           2496.93
## coef.var         0.22        0.65           0.06      0.55              0.55
##              revol.util inq.last.6mths delinq.2yrs pub.rec   revol.balN
## nbr.val         9578.00        9578.00     9578.00 9578.00 9.578000e+03
## nbr.null         297.00        3637.00     8458.00 9019.00 3.210000e+02
## nbr.na             0.00           0.00        0.00    0.00 0.000000e+00
## min                0.00           0.00        0.00    0.00 0.000000e+00
## max              119.00          33.00       13.00    5.00 1.207359e+06
## range            119.00          33.00       13.00    5.00 1.207359e+06
## sum           448243.08       15109.00     1568.00  595.00 1.620019e+08
## median            46.30           1.00        0.00    0.00 8.596000e+03
## mean              46.80           1.58        0.16    0.06 1.691396e+04
## SE.mean            0.30           0.02        0.01    0.00 3.449200e+02
## CI.mean.0.95       0.58           0.04        0.01    0.01 6.761100e+02
## var              841.84           4.84        0.30    0.07 1.139480e+09
## std.dev           29.01           2.20        0.55    0.26 3.375619e+04
## coef.var           0.62           1.39        3.34    4.22 2.000000e+00

###The stat.desc function also calculates basic statistics. round is needed to round the value to two decimal places so that the results look better. I also removed categorical variables, because their statistical description does not make any sense

Here, too, I will explain only one variable - installment. nbr.val 9578.00 is the number of observations that have a value, that is, which are not missing or not zero. in this particular case, it is all 9578 observations nbr.null 0.00 and nbr.na 0.00 - this means that we have neither zeros nor missing data in the installment data min 15.67 - that is, the minimum monthly payment is $15.67 max 940.14 - this means that the borrower’s maximum monthly payment is $940.14 range 924.47 means that the difference between the largest and smallest monthly payments of the sample is $924.47 sum 3056238.40 - that is, for every month all members of the sample taken together must make a payment of $3056238.40 per month. median 268.95 means that half of borrowers pay $268.95 or less each month. mean 319.09 means that on average the borrower pays $319.0 per month

library(psych)

describe(data [   , c(-10, -11, -12) ])   ###another function to calculate statistics, without categorical variables as well

##                   vars    n     mean       sd  median  trimmed     mad    min
## int.rate             1 9578     0.12     0.03    0.12     0.12    0.03   0.06
## installment          2 9578   319.09   207.07  268.95   295.64  184.88  15.67
## log.annual.inc       3 9578    10.93     0.61   10.93    10.93    0.55   7.55
## dti                  4 9578    12.61     6.88   12.66    12.59    7.98   0.00
## days.with.cr.line    5 9578  4560.77  2496.93 4139.96  4303.64 2135.06 178.96
## revol.util           6 9578    46.80    29.01   46.30    46.50   35.88   0.00
## inq.last.6mths       7 9578     1.58     2.20    1.00     1.16    1.48   0.00
## delinq.2yrs          8 9578     0.16     0.55    0.00     0.02    0.00   0.00
## pub.rec              9 9578     0.06     0.26    0.00     0.00    0.00   0.00
## revol.balN          10 9578 16913.96 33756.19 8596.00 10809.19 9619.11   0.00
##                          max      range  skew kurtosis     se
## int.rate                0.22       0.16  0.17    -0.21   0.00
## installment           940.14     924.47  0.91     0.14   2.12
## log.annual.inc         14.53       6.98  0.03     1.61   0.01
## dti                    29.96      29.96  0.02    -0.90   0.07
## days.with.cr.line   17639.96   17461.00  1.16     1.94  25.51
## revol.util            119.00     119.00  0.06    -1.12   0.30
## inq.last.6mths         33.00      33.00  3.58    26.27   0.02
## delinq.2yrs            13.00      13.00  6.06    71.38   0.01
## pub.rec                 5.00       5.00  5.12    38.75   0.00
## revol.balN        1207359.00 1207359.00 11.16   259.46 344.92

here I propose to analyze the pub.rec variable, i.e. The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments). This is variable (var) number 9 in the output table. n means the number of observations, which are always 9578 mean, as always, means the arithmetic mean, that is, on average, borrowers have 0.06 public records for each. sd stands for standard deviation, i.e, it means that the values are, on average, 0,26 units (number of records) away from the mean median of 0.00 means that half of the people have 0 public records trimmed refers to a dataset from which a certain proportion of the data points have been removed from one or both ends of the distribution. This is done to reduce the influence of extreme values (outliers) on statistical measures. As we can see, there are no outliers in here. the minimum and maximum values show the largest and smallest number of public records the borrower has, in this case 0 and 5, respectively. The range shows the spread of the data, what is their spectrum, it is max-min=5-0=5 units. skewness (skew) means asymmetry of the distribution to the left or to the right from the mean. in this case there is a positive skew, means there is a longer “tail” om the right-hand side. kurtosis refers to the measure of the tailedness or peakedness of a distribution.in our case, this means heavier tails and a sharper peak compared to a normal distribution

describeBy(data$dti, group = data$purpose)

## 
##  Descriptive statistics by group 
## group: credit_card
##    vars    n mean   sd median trimmed  mad min   max range  skew kurtosis   se
## X1    1 1262 14.1 6.47  14.38   14.21 7.46   0 29.95 29.95 -0.11     -0.8 0.18
## ------------------------------------------------------------ 
## group: debt_consolidation
##    vars    n  mean   sd median trimmed  mad min   max range  skew kurtosis  se
## X1    1 3957 14.08 6.43  14.24   14.18 7.35   0 29.96 29.96 -0.09    -0.77 0.1
## ------------------------------------------------------------ 
## group: educational
##    vars   n  mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 343 11.34 6.94  11.42   11.15 8.29   0 29.74 29.74 0.19    -0.87 0.37
## ------------------------------------------------------------ 
## group: major_purchase
##    vars   n  mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 437 10.16 6.63   9.51    9.91 7.78   0 26.15 26.15 0.26    -0.96 0.32
## ------------------------------------------------------------ 
## group: small_business
##    vars   n  mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 619 10.79 6.93  10.39   10.53 8.09   0 29.21 29.21 0.26    -0.92 0.28
## ------------------------------------------------------------ 
## group: all_other
##    vars    n  mean  sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 2331 11.08 7.1  10.56   10.84 8.51   0 29.9  29.9 0.23    -0.92 0.15
## ------------------------------------------------------------ 
## group: home_improvement
##    vars   n mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 629 10.2 6.78   9.66    9.82 8.04   0 28.17 28.17 0.38    -0.75 0.27

this function calculated the strategy for the variable “dti” for each category from the categorical variable “purpose” separately. it is more convenient to compare these data through knit. but actually, the output will be in this format anyway so, we can see that, for example, the most popular purpose is debt_consolidation, with 3957 observations in this category. followed by all_other and credit_card with values of 2331 and 1262 respectively. the least popular goal, on the other hand, is educational, with only 343 observations. the average debt-to-income ratio also varies by category. well, for credit_card, for example, this ratio is 14.1, meaning that, on average, the debt of those who borrow money for a credit card is 14.1 times higher than their income. This, by the way, is also the highest ratio instead, for example, for those who borrow money for educational purposes, the amount of debt exceeds their income by 11.34 times. proportion has the lowest value for major_purchase and home_improvement and is 10.16 and 10.20, respectively. the two highest median values are 14.38 and 14.24 for credit_card and debt_consolidation, respectively, which means that half of people who borrow for credit cards have a debt-to-income ratio of 14.38:1 or less, and half of those who borrow for debt consolidation, have a debt-to-income ratio of 14.24:1 or less. By “less” I mean that there are fewer borrowed dollars per $1 of income.

describeBy(data$revol.balN, group = data$credit.policy.F)

## 
##  Descriptive statistics by group 
## group: Fulfill
##    vars    n    mean       sd median trimmed     mad min    max  range skew
## X1    1 7710 13798.4 16878.56 8707.5 10514.1 9423.41   0 149527 149527 2.97
##    kurtosis     se
## X1    12.54 192.22
## ------------------------------------------------------------ 
## group: Oterwise
##    vars    n     mean       sd median  trimmed      mad min     max   range
## X1    1 1868 29773.15 66807.57 8039.5 14032.46 10343.36   0 1207359 1207359
##    skew kurtosis      se
## X1 6.64    79.16 1545.74

so here’s another example of that. how to calculate statistics for each category separately. here such statistics are calculated for the quantitative variable “revol.balN”, which means The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). It is grouped by categories, whether the customer meets the credit underwriting criteria of the platform or not. let’s analyze here the values that I did not analyze in the previous example. therefore, the minimum and maximum amount of outstanding debt for those who meet the criteria is $0 and $149,527, respectively. While for those who do not meet the criteria, the minimum and maximum values are 0$ and 1207359$ respectively. both skewnesses have a positive value: 2.97 for those who meet the criteria and 6.64 for those who do not meet the criteria. positive skew means that the right “tail” is longer. kurtosis in both cases is positive, so the distribution deviates from normal upwards. moreover, for those who do not meet the criteria, the value is larger (79.16 vs 12.54), that is, the distribution of those who do not meet is even “higher”.

hist(data$dti,   ###to chose variable
     main = "Distribution of Debt-to-Income Ratio",   ###add title 
     ylab = "Frequency",    ### add name for the y axis
     xlab = "Debt-to-Income Ratio",    ###add name for the x axis
     breaks = seq(from = 0, to = 30, by = 5))    ###so the data will range from 0 to 30 and the x-axis interval will be 5 units

a histogram visualizes debt-to-income ratio data and shows the frequency with which each ratio occurs. on the Y-axis we can see the frequency. here we see that the ratio 10-15 occurs most often, and the ratio 25-30 is the least frequent. We can also see that there is positive skewness here.

#install.packages("ggplot2")

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(data, aes(x = dti)) +   ###add dataframe and variable for x-axis
  geom_histogram(binwidth = 5,    ###the x-axis interval will be 5 units
                 colour = "darkblue",    ###Change the color of the column borders
                 fill = "skyblue") +   ###Change the color of the columns
  labs(title = "Debt-to-Income Frequency",   ###add title
       x = "Debt-to-Income Ratio",   ###name x-axis
       y = "Frequency") +   ###name y-axis 
  theme_minimal()   ###change theme

in fact, here we see the same histogram, for the same variable. the difference is only in the creation tool (that is, the R command and the package to which it belongs), and the fact that this one has a slightly nicer appearance.

boxplot(data$revol.util)

here we can see the boxplot for the revol.util variable. there is no scale on the x-axis, but the y-axis can be read this way. the limits show the minimum (0) and maximum (119) value, that is, the range of the data can be calculated in this way. the thick line represents the mean value, which in this case is 46.80 units. Let me remind again that this variable means the amount of the credit line used relative to total credit available and denotes the share. the boundaries of the gray figure show the first and third quantile values, which are 22.6 and 70.9, respectively.

library(ggplot2)

ggplot(data, aes(x = revol.util)) +
  geom_boxplot(colour = "darkred",    ###Change the color of the column borders
                 fill = "pink") + ###Change the color of the columns
  labs(title = "revolving line utilization rate",   ###add title
       x = "values",   ###name x-axis
       y = "revolving line utilization rate") +   ###name y-axis 
  theme_classic()    ###change theme

here is actually the same boxplot for revol.util. the difference is only in the creation tool (that is, the R command and the package to which it belongs), nicer appearance and the fact that here the values are located along the x-axis, and the graph itself is horizontal, not vertical.

ggplot(data, aes(x=dti,   ###add variable
                 fill=purposeF))+   ###group by categories
  geom_boxplot() +   ###type of th egraph
  xlab("Debt-to-Income Ratio") +   ###x-axis name
  labs(fill="Purpose of Credit",   ###add legent 
       title = "Debt-to-Income based on Purpose")   ###add title

such a graph allows us to visually compare how the Debt-to-Income Ratio differs depending on the amount of the loan. For example, we see that on average this ratio is much higher for people whose goal is an average credit card and debt consumption. range for those who borrow for a major purchase is the smallest, its maximum value of ratio is no more than 26. Although the average is approximately the same in small business and all other reasons, this ratio for 75% of borrowers for the purpose of small business is smaller, which can be seen from the limits box, that is, the third quantiles.

library(ggplot2)

ggplot(data, aes(x = int.rate, y = log.annual.inc)) +   ###add variables for x- and y-axises
  geom_point(shape = 21,    ###change shape of the points
             size = 3,    ###change size of the points
             color = "white",    ###Change the color of the points borders
             fill = "wheat1") +    ######Change the color of the points
  labs(title = "Income-to-Interest Rate Relation",    ###add title
       x = "Interest Rate",   ###and name to the x-axis
       y = "Annual Income in log-points")+   ###add name to the y-axis
  theme_dark()    ###change theme

so we can investigate the relationship between the interest rate and the borrower’s income.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(int.rate ~ revol.balN | not.fully.paidF,    ###add variables: int.rate to y-axis, revol.balN to x-axis, goup by not.fully.paidF
            smooth = FALSE, 
            xlab = "Balance",    ###add x-axis name
            ylab = "Intereset rate",    ###add y-axis name
            main = "Scatterplot",    ###add title
            data = data,   ###choose dataframe
            cex = 0.5,    ###change points size
            bg = "lightyellow")

In addition to showing the relationship between two numerical variables, this graph also allows us to compare how this relationship differs based on the categorical variable “not.fully.paid”. we can clearly see that the correlation is much stronger for those who have paid in full because the trend line is deeper. for both categories, this ratio is positive, because the trend lines “go up”.

hw1

2024-04-05