Intro

We will create a regression model using “credit.csv” dataset. We would like to learn the relationship among variables with the target variables of “amount” which is the loan amount. We also wanted to predict the newly disbursed loan amount based on the available historical data of “credit.csv” dataframe.

Data Preparation

-> Load the required package:

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)

options(scipen = 100, max.print = 1e+06)

-> Load the dataset and make sure to use ‘stringsAsFactors = TRUE’ to change string/character as factor to ensure faster modelling.

credit <- read.csv('data_input/credit.csv',stringsAsFactors = TRUE)

-> Check the data structure and see whether there are variables that have wrong data type

str(credit)
#> 'data.frame':    1000 obs. of  17 variables:
#>  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
#>  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
#>  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
#>  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
#>  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
#>  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
#>  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
#>  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
#>  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
#>  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
#>  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
#>  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
#>  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
#>  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
#>  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
#>  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

From the data structure we can see that no data type are deemed not appropriate. The data has 1000 rows/obs and 17 columns/variables. The target variable is the amount, which signifies the loan amount approved and disbursed by the Bank.

-> To check if there’s any missing row data

is.na(credit) %>% colSums()
#>     checking_balance months_loan_duration       credit_history 
#>                    0                    0                    0 
#>              purpose               amount      savings_balance 
#>                    0                    0                    0 
#>  employment_duration    percent_of_income   years_at_residence 
#>                    0                    0                    0 
#>                  age         other_credit              housing 
#>                    0                    0                    0 
#> existing_loans_count                  job           dependents 
#>                    0                    0                    0 
#>                phone              default 
#>                    0                    0

We can see that there are no data that are missing and the dataframe is ready for modelling.

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the correlation of numerical data variables.

Calculate Pearson correlation between numerical variables.

ggcorr(credit, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

-> Conclusion: - If the target variable is ‘amount’ then the only variable that have correlation is ‘months_loan_duration’ which have positive correlation and ‘percent_of_income’ that have negative correlation on ‘amount’ as the target variable.

Modelling

Train-Test Split

Before modelling process, we have to split the data into train dataset and test dataset. We will use the train dataset to train the model and the test dataset will be used as a comparasion and see if the model is overfit/underfit and fail to predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

set.seed(123)
samplesize <- round(0.8 * nrow(credit), 0)
index <- sample(seq_len(nrow(credit)), size = samplesize)

data_train <- credit[index, ]
data_test <- credit[-index, ]

Linear Regression

set.seed(123)
credit_lm <- lm(amount ~ ., data = data_train)

summary(credit_lm)
#> 
#> Call:
#> lm(formula = amount ~ ., data = data_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5538.5 -1092.2  -175.1   669.9 10716.7 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value
#> (Intercept)                     4301.915    734.301   5.859
#> checking_balance> 200 DM        -353.572    298.107  -1.186
#> checking_balance1 - 200 DM       345.765    189.668   1.823
#> checking_balanceunknown          267.788    184.278   1.453
#> months_loan_duration             128.444      6.092  21.084
#> credit_historygood              -218.539    194.375  -1.124
#> credit_historyperfect            454.928    397.351   1.145
#> credit_historypoor               240.899    277.753   0.867
#> credit_historyvery good         -338.047    368.942  -0.916
#> purposecar                       228.957    258.924   0.884
#> purposecar0                     1665.503    682.902   2.439
#> purposeeducation                -107.377    365.882  -0.293
#> purposefurniture/appliances     -259.710    250.336  -1.037
#> purposerenovations              -158.568    530.335  -0.299
#> savings_balance> 1000 DM          54.689    320.264   0.171
#> savings_balance100 - 500 DM     -255.429    231.792  -1.102
#> savings_balance500 - 1000 DM    -404.899    285.296  -1.419
#> savings_balanceunknown           521.820    190.624   2.737
#> employment_duration> 7 years    -186.251    239.359  -0.778
#> employment_duration1 - 4 years  -114.498    203.016  -0.564
#> employment_duration4 - 7 years   175.216    235.859   0.743
#> employment_durationunemployed    132.752    362.411   0.366
#> percent_of_income               -794.834     62.511 -12.715
#> years_at_residence                -6.375     69.115  -0.092
#> age                                5.024      6.855   0.733
#> other_creditnone                 187.375    200.284   0.936
#> other_creditstore                297.015    364.689   0.814
#> housingown                      -570.186    236.301  -2.413
#> housingrent                     -637.314    278.514  -2.288
#> existing_loans_count             -53.091    152.517  -0.348
#> jobskilled                     -1391.930    225.871  -6.163
#> jobunemployed                  -1832.279    523.735  -3.498
#> jobunskilled                   -1408.703    275.881  -5.106
#> dependents                      -211.438    193.136  -1.095
#> phoneyes                         517.161    152.652   3.388
#> defaultyes                       409.485    166.283   2.463
#>                                            Pr(>|t|)    
#> (Intercept)                           0.00000000694 ***
#> checking_balance> 200 DM                   0.235968    
#> checking_balance1 - 200 DM                 0.068693 .  
#> checking_balanceunknown                    0.146585    
#> months_loan_duration           < 0.0000000000000002 ***
#> credit_historygood                         0.261233    
#> credit_historyperfect                      0.252608    
#> credit_historypoor                         0.386044    
#> credit_historyvery good                    0.359819    
#> purposecar                                 0.376834    
#> purposecar0                                0.014960 *  
#> purposeeducation                           0.769240    
#> purposefurniture/appliances                0.299857    
#> purposerenovations                         0.765024    
#> savings_balance> 1000 DM                   0.864455    
#> savings_balance100 - 500 DM                0.270820    
#> savings_balance500 - 1000 DM               0.156242    
#> savings_balanceunknown                     0.006336 ** 
#> employment_duration> 7 years               0.436737    
#> employment_duration1 - 4 years             0.572929    
#> employment_duration4 - 7 years             0.457782    
#> employment_durationunemployed              0.714241    
#> percent_of_income              < 0.0000000000000002 ***
#> years_at_residence                         0.926532    
#> age                                        0.463831    
#> other_creditnone                           0.349802    
#> other_creditstore                          0.415651    
#> housingown                                 0.016058 *  
#> housingrent                                0.022394 *  
#> existing_loans_count                       0.727861    
#> jobskilled                            0.00000000116 ***
#> jobunemployed                              0.000495 ***
#> jobunskilled                          0.00000041540 ***
#> dependents                                 0.273966    
#> phoneyes                                   0.000741 ***
#> defaultyes                                 0.014014 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1875 on 764 degrees of freedom
#> Multiple R-squared:  0.5862, Adjusted R-squared:  0.5672 
#> F-statistic: 30.92 on 35 and 764 DF,  p-value: < 0.00000000000000022
summary(credit_lm)$coefficients
#>                                    Estimate Std. Error      t value
#> (Intercept)                     4301.914919 734.301230   5.85851521
#> checking_balance> 200 DM        -353.571665 298.106631  -1.18605770
#> checking_balance1 - 200 DM       345.764781 189.667510   1.82300481
#> checking_balanceunknown          267.787941 184.277519   1.45317748
#> months_loan_duration             128.443937   6.091947  21.08421722
#> credit_historygood              -218.538770 194.375158  -1.12431430
#> credit_historyperfect            454.927946 397.350631   1.14490304
#> credit_historypoor               240.898652 277.753405   0.86731125
#> credit_historyvery good         -338.046944 368.941890  -0.91626067
#> purposecar                       228.956516 258.924405   0.88426008
#> purposecar0                     1665.502953 682.902023   2.43886077
#> purposeeducation                -107.376634 365.882452  -0.29347304
#> purposefurniture/appliances     -259.710244 250.336402  -1.03744498
#> purposerenovations              -158.568107 530.335368  -0.29899591
#> savings_balance> 1000 DM          54.689447 320.264381   0.17076344
#> savings_balance100 - 500 DM     -255.428754 231.792080  -1.10197360
#> savings_balance500 - 1000 DM    -404.899320 285.296380  -1.41922347
#> savings_balanceunknown           521.820086 190.624372   2.73742587
#> employment_duration> 7 years    -186.250739 239.358956  -0.77812313
#> employment_duration1 - 4 years  -114.498338 203.016291  -0.56398596
#> employment_duration4 - 7 years   175.215510 235.859376   0.74288126
#> employment_durationunemployed    132.752173 362.411006   0.36630282
#> percent_of_income               -794.834227  62.510716 -12.71516759
#> years_at_residence                -6.375166  69.115085  -0.09223987
#> age                                5.024025   6.854791   0.73292169
#> other_creditnone                 187.374599 200.283600   0.93554639
#> other_creditstore                297.014773 364.689347   0.81443227
#> housingown                      -570.185717 236.300557  -2.41296815
#> housingrent                     -637.314122 278.513888  -2.28826694
#> existing_loans_count             -53.091019 152.516872  -0.34809932
#> jobskilled                     -1391.930229 225.870875  -6.16250426
#> jobunemployed                  -1832.279355 523.734895  -3.49848630
#> jobunskilled                   -1408.702822 275.881198  -5.10619366
#> dependents                      -211.438114 193.136275  -1.09476127
#> phoneyes                         517.160900 152.651925   3.38784395
#> defaultyes                       409.485305 166.282896   2.46258223
#>                                                                                                              Pr(>|t|)
#> (Intercept)                    0.000000006939458025623265839936654186814735112420748919248580932617187500000000000000
#> checking_balance> 200 DM       0.235968236551983345794525348537717945873737335205078125000000000000000000000000000000
#> checking_balance1 - 200 DM     0.068693333574170192812324842179805273190140724182128906250000000000000000000000000000
#> checking_balanceunknown        0.146585089037442273252054292242974042892456054687500000000000000000000000000000000000
#> months_loan_duration           0.000000000000000000000000000000000000000000000000000000000000000000000000000003924887
#> credit_historygood             0.261232835320175160731537289393600076436996459960937500000000000000000000000000000000
#> credit_historyperfect          0.252607855827542193694057459651958197355270385742187500000000000000000000000000000000
#> credit_historypoor             0.386043817439114866907345913205062970519065856933593750000000000000000000000000000000
#> credit_historyvery good        0.359819308659901526681323957745917141437530517578125000000000000000000000000000000000
#> purposecar                     0.376834016879434297031536971189780160784721374511718750000000000000000000000000000000
#> purposecar0                    0.014960330531394868194827729723783704685047268867492675781250000000000000000000000000
#> purposeeducation               0.769240307074732343473044693382689729332923889160156250000000000000000000000000000000
#> purposefurniture/appliances    0.299856754947839199143544419712270610034465789794921875000000000000000000000000000000
#> purposerenovations             0.765024475096179989996869608148699626326560974121093750000000000000000000000000000000
#> savings_balance> 1000 DM       0.864454991925646121231352481117937713861465454101562500000000000000000000000000000000
#> savings_balance100 - 500 DM    0.270820172547056647882612878674990497529506683349609375000000000000000000000000000000
#> savings_balance500 - 1000 DM   0.156241722975992480826334940502420067787170410156250000000000000000000000000000000000
#> savings_balanceunknown         0.006336194577355417668906856221155976527370512485504150390625000000000000000000000000
#> employment_duration> 7 years   0.436737334872707494426435914647299796342849731445312500000000000000000000000000000000
#> employment_duration1 - 4 years 0.572929200325231091639466285414528101682662963867187500000000000000000000000000000000
#> employment_duration4 - 7 years 0.457781927357519768939653204142814502120018005371093750000000000000000000000000000000
#> employment_durationunemployed  0.714240538414836478864344826433807611465454101562500000000000000000000000000000000000
#> percent_of_income              0.000000000000000000000000000000000979342915146096460767000113634139779605902731418610
#> years_at_residence             0.926531642894123597109512502356665208935737609863281250000000000000000000000000000000
#> age                            0.463831002770212275088113074161810800433158874511718750000000000000000000000000000000
#> other_creditnone               0.349802397489268446229004894121317192912101745605468750000000000000000000000000000000
#> other_creditstore              0.415651136680942334677979488333221524953842163085937500000000000000000000000000000000
#> housingown                     0.016057663386048567910613371623185230419039726257324218750000000000000000000000000000
#> housingrent                    0.022394241225180615967316199999004311393946409225463867187500000000000000000000000000
#> existing_loans_count           0.727861487334296564100100113137159496545791625976562500000000000000000000000000000000
#> jobskilled                     0.000000001158131370407223614360341101825468967945198528468608856201171875000000000000
#> jobunemployed                  0.000494983627366074618349200964217970977188087999820709228515625000000000000000000000
#> jobunskilled                   0.000000415395249908262892673399546694668060808908194303512573242187500000000000000000
#> dependents                     0.273966249915296444772394579558749683201313018798828125000000000000000000000000000000
#> phoneyes                       0.000740521458109598277111396136263010703260079026222229003906250000000000000000000000
#> defaultyes                     0.014013668520204447184851481722489552339538931846618652343750000000000000000000000000

-> Conclusion: From the first lm model, we can see that the adjusted R-squared are quite low (below 80%) and there are a lot of variables that don’t have significance level (indicated by the number of Pr(>|t|) that are above 0.05 - the lower are the more significant). Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model.

Variable to be considered based on significance level: - months_loan_duration - percent_of_income - job - phone - default

We redo the linier regression modelling using the selected variables:

credit2 <- credit %>% select(months_loan_duration,percent_of_income,job,phone,default,amount )

data_train2 <- credit2[index, ]
data_test2 <- credit2[-index, ]

set.seed(123)
credit_lm2 <- lm(amount ~ ., data = data_train2)

summary(credit_lm2)
#> 
#> Call:
#> lm(formula = amount ~ ., data = data_train2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5223.6 -1067.5  -155.5   658.1 11339.5 
#> 
#> Coefficients:
#>                       Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)           3869.505    312.701  12.374 < 0.0000000000000002 ***
#> months_loan_duration   137.348      5.834  23.544 < 0.0000000000000002 ***
#> percent_of_income     -816.179     60.799 -13.424 < 0.0000000000000002 ***
#> jobskilled           -1619.202    208.366  -7.771   0.0000000000000242 ***
#> jobunemployed        -1621.364    495.775  -3.270              0.00112 ** 
#> jobunskilled         -1653.229    258.321  -6.400   0.0000000002659802 ***
#> phoneyes               647.825    151.359   4.280   0.0000209740597980 ***
#> defaultyes             311.598    153.555   2.029              0.04277 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1914 on 792 degrees of freedom
#> Multiple R-squared:  0.553,  Adjusted R-squared:  0.549 
#> F-statistic: 139.9 on 7 and 792 DF,  p-value: < 0.00000000000000022

-> the comparison of the both adjusted R square of the two models are as follows: - credit_lm: Adjusted R-squared: 0.5672 - credit_lm2: Adjusted R-squared: 0.549

Conclusion: we can see from the comparison that the difference between the two models are not significant thus using credit_lm2 with simpler model is more acceptable due to with the simplicity of the model but can achieve similar result.

Model Evaluation

Model Performance

The performance of our model can be evaluate using RMSE (Root Mean Squared Error)

\[ RMSE = \sqrt{\frac{ \sum (\hat y - y)^2} {n}} \] We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.

-> Performance of Model credit_lm

lm_pred <- predict(credit_lm, newdata = data_test %>% select(-amount))

# RMSE of train dataset
RMSE(pred = credit_lm$fitted.values, obs = data_train$amount)
#> [1] 1832.503
# RMSE of test dataset
RMSE(pred = lm_pred, obs = data_test$amount)
#> [1] 1753.368

-> Performance of Model credit_lm2

lm_pred2 <- predict(credit_lm2, newdata = data_test2 %>% select(-amount))

# RMSE of train dataset
RMSE(pred = credit_lm2$fitted.values, obs = data_train2$amount)
#> [1] 1904.685
# RMSE of test dataset
RMSE(pred = lm_pred2, obs = data_test2$amount)
#> [1] 1719.448

Performance Conclusion:

To determine which model is better, let’s evaluate both based on the Root Mean Squared Error (RMSE) values for both the training and testing datasets. RMSE provides an indication of the model’s prediction error, where a lower value generally indicates better performance. However, we need to look at both training and test RMSE to evaluate overfitting and generalization ability.

Here are the RMSE values:

Model: credit_lm RMSE (Train): 1832.503 RMSE (Test): 1753.368

Model: credit_lm2 RMSE (Train): 1904.685 RMSE (Test): 1719.448

Comparison and Interpretation:

Generalization Performance (Test RMSE): For credit_lm, the RMSE on the test data is 1753.368. For credit_lm2, the RMSE on the test data is 1719.448. Model credit_lm2 has a lower test RMSE compared to credit_lm, indicating it performs slightly better on unseen data.

Training Performance (Train RMSE): For credit_lm, the RMSE on the training data is 1832.503. For credit_lm2, the RMSE on the training data is 1904.685. credit_lm has a lower training RMSE, meaning it fits the training data better than credit_lm2. However, this does not necessarily mean it generalizes well to new data.

Overfitting Evaluation: The gap between train and test RMSE gives an indication of overfitting: credit_lm has a higher gap between training (1832.503) and test (1753.368) RMSE values. credit_lm2 has a smaller gap between training (1904.685) and test (1719.448) RMSE values. credit_lm2 appears to generalize better since the test performance is close to its training performance, while credit_lm has a larger gap, suggesting it may be slightly overfitting to the training data.

Conclusion: credit_lm2 is likely the better model because it has a lower test RMSE (1719.448) compared to credit_lm (1753.368), which means it performs better on new, unseen data. Additionally, the smaller gap between training and test RMSE for credit_lm2 indicates it generalizes better and is less prone to overfitting than credit_lm.

Thus, based on the RMSE values, credit_lm2 is a better model overall due to its lower test error and better generalization capability.

Model Assumptions

  1. Linearity
resact <- data.frame(residual = credit_lm2$residuals, fitted = credit_lm2$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

Residuals are around value of 0.

Conclusion: Due to the residuals are around value of 0, thus linearity assumption is fulfilled.

Status: Linearity assumption [Fulfilled].

Additional test for linearity

hist(credit_lm2$residuals)

Linear regression model is expected to have error that are normally distributed. the above chart displayed that the error are focussed on value of 0 thus linearity assumption are fulfilled.

  1. Normality
shapiro.test(credit_lm2$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  credit_lm2$residuals
#> W = 0.89477, p-value < 0.00000000000000022

Conclusion: the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.

Status: Normality assumption [Not Fulfilled].

  1. Autocorrelation
durbinWatsonTest(credit_lm2)
#>  lag Autocorrelation D-W Statistic p-value
#>    1     0.001180496      1.997127   0.912
#>  Alternative hypothesis: rho != 0

Conclusion: The result shows that the null hypothesis is accepted, meaning that our residual has no autocorrelation.

Status: No autocorrelation assumption [Fulfilled].

  1. Homoscedasticity of Residuals (No Heterocedasticity)
bptest(credit_lm2)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  credit_lm2
#> BP = 138.57, df = 7, p-value < 0.00000000000000022
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

Conclusion: We can observe that on lower fitted values, the residuals are concentrated around the value of 0. As the fitted value increases, the residuals are also got bigger. Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is present in our model.

Status: Homoscedasticity of Residuals assumption [Not Fulfilled].

  1. Multicollinearity
vif(credit_lm2)
#>                          GVIF Df GVIF^(1/(2*Df))
#> months_loan_duration 1.112548  1        1.054774
#> percent_of_income    1.027793  1        1.013801
#> job                  1.255754  3        1.038686
#> phone                1.210974  1        1.100443
#> default              1.054460  1        1.026869

Conclusion:Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. Thus no multicollinearity detected.

Status: No Multicollinearity assumption [Fulfilled].

Final Conclusion of Assumption From 5 assumption tests, it is concluded that: 1. Linearity -> [Fulfilled] 2. Normality of Residuals -> [Not Fulfilled] 3. No Autocorrelation -> [Fulfilled] 4. Homoscedasticity of Residuals -> [Not Fulfilled] 5. No Multicollinearity -> [Fulfilled]

Based on assumption test, credit_lm2 model is able to fulfill 3 from 5 total test which is linearity, no autocorrelation and no multicollinearity. But from 5 assumption, the two that is mandatory to be fulfilled are linearity and no multicollinearity. Thus the model is usable.

Conclusion

Variables that are useful to describe the variances in credit amount are months_loan_duration, percent_of_income, job, phone, default. Our final model has satisfied the classical assumptions. The R-squared of the model is sufficient, with 54.9% of the variables can explain the variances in the credit amount. The accuracy of the model in predicting the car price is measured with RMSE, in which that our model may slightly overfit the traning dataset.

We have already learn how to build a linear regression model and what need to be concerned when building the model.