We will create a regression model using “credit.csv” dataset. We would like to learn the relationship among variables with the target variables of “amount” which is the loan amount. We also wanted to predict the newly disbursed loan amount based on the available historical data of “credit.csv” dataframe.
-> Load the required package:
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
options(scipen = 100, max.print = 1e+06)-> Load the dataset and make sure to use ‘stringsAsFactors = TRUE’ to change string/character as factor to ensure faster modelling.
-> Check the data structure and see whether there are variables that have wrong data type
#> 'data.frame': 1000 obs. of 17 variables:
#> $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
#> $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
#> $ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
#> $ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
#> $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
#> $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
#> $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
#> $ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
#> $ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
#> $ age : int 67 22 49 45 53 35 53 35 61 28 ...
#> $ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
#> $ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
#> $ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
#> $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
#> $ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
#> $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
From the data structure we can see that no data type are deemed not appropriate. The data has 1000 rows/obs and 17 columns/variables. The target variable is the amount, which signifies the loan amount approved and disbursed by the Bank.
-> To check if there’s any missing row data
#> checking_balance months_loan_duration credit_history
#> 0 0 0
#> purpose amount savings_balance
#> 0 0 0
#> employment_duration percent_of_income years_at_residence
#> 0 0 0
#> age other_credit housing
#> 0 0 0
#> existing_loans_count job dependents
#> 0 0 0
#> phone default
#> 0 0
We can see that there are no data that are missing and the dataframe is ready for modelling.
Exploratory data analysis is a phase where we explore the correlation of numerical data variables.
Calculate Pearson correlation between numerical variables.
-> Conclusion: - If the target variable is ‘amount’ then the only
variable that have correlation is ‘months_loan_duration’ which have
positive correlation and ‘percent_of_income’ that have negative
correlation on ‘amount’ as the target variable.
Before modelling process, we have to split the data into train dataset and test dataset. We will use the train dataset to train the model and the test dataset will be used as a comparasion and see if the model is overfit/underfit and fail to predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.
#>
#> Call:
#> lm(formula = amount ~ ., data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5538.5 -1092.2 -175.1 669.9 10716.7
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 4301.915 734.301 5.859
#> checking_balance> 200 DM -353.572 298.107 -1.186
#> checking_balance1 - 200 DM 345.765 189.668 1.823
#> checking_balanceunknown 267.788 184.278 1.453
#> months_loan_duration 128.444 6.092 21.084
#> credit_historygood -218.539 194.375 -1.124
#> credit_historyperfect 454.928 397.351 1.145
#> credit_historypoor 240.899 277.753 0.867
#> credit_historyvery good -338.047 368.942 -0.916
#> purposecar 228.957 258.924 0.884
#> purposecar0 1665.503 682.902 2.439
#> purposeeducation -107.377 365.882 -0.293
#> purposefurniture/appliances -259.710 250.336 -1.037
#> purposerenovations -158.568 530.335 -0.299
#> savings_balance> 1000 DM 54.689 320.264 0.171
#> savings_balance100 - 500 DM -255.429 231.792 -1.102
#> savings_balance500 - 1000 DM -404.899 285.296 -1.419
#> savings_balanceunknown 521.820 190.624 2.737
#> employment_duration> 7 years -186.251 239.359 -0.778
#> employment_duration1 - 4 years -114.498 203.016 -0.564
#> employment_duration4 - 7 years 175.216 235.859 0.743
#> employment_durationunemployed 132.752 362.411 0.366
#> percent_of_income -794.834 62.511 -12.715
#> years_at_residence -6.375 69.115 -0.092
#> age 5.024 6.855 0.733
#> other_creditnone 187.375 200.284 0.936
#> other_creditstore 297.015 364.689 0.814
#> housingown -570.186 236.301 -2.413
#> housingrent -637.314 278.514 -2.288
#> existing_loans_count -53.091 152.517 -0.348
#> jobskilled -1391.930 225.871 -6.163
#> jobunemployed -1832.279 523.735 -3.498
#> jobunskilled -1408.703 275.881 -5.106
#> dependents -211.438 193.136 -1.095
#> phoneyes 517.161 152.652 3.388
#> defaultyes 409.485 166.283 2.463
#> Pr(>|t|)
#> (Intercept) 0.00000000694 ***
#> checking_balance> 200 DM 0.235968
#> checking_balance1 - 200 DM 0.068693 .
#> checking_balanceunknown 0.146585
#> months_loan_duration < 0.0000000000000002 ***
#> credit_historygood 0.261233
#> credit_historyperfect 0.252608
#> credit_historypoor 0.386044
#> credit_historyvery good 0.359819
#> purposecar 0.376834
#> purposecar0 0.014960 *
#> purposeeducation 0.769240
#> purposefurniture/appliances 0.299857
#> purposerenovations 0.765024
#> savings_balance> 1000 DM 0.864455
#> savings_balance100 - 500 DM 0.270820
#> savings_balance500 - 1000 DM 0.156242
#> savings_balanceunknown 0.006336 **
#> employment_duration> 7 years 0.436737
#> employment_duration1 - 4 years 0.572929
#> employment_duration4 - 7 years 0.457782
#> employment_durationunemployed 0.714241
#> percent_of_income < 0.0000000000000002 ***
#> years_at_residence 0.926532
#> age 0.463831
#> other_creditnone 0.349802
#> other_creditstore 0.415651
#> housingown 0.016058 *
#> housingrent 0.022394 *
#> existing_loans_count 0.727861
#> jobskilled 0.00000000116 ***
#> jobunemployed 0.000495 ***
#> jobunskilled 0.00000041540 ***
#> dependents 0.273966
#> phoneyes 0.000741 ***
#> defaultyes 0.014014 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1875 on 764 degrees of freedom
#> Multiple R-squared: 0.5862, Adjusted R-squared: 0.5672
#> F-statistic: 30.92 on 35 and 764 DF, p-value: < 0.00000000000000022
#> Estimate Std. Error t value
#> (Intercept) 4301.914919 734.301230 5.85851521
#> checking_balance> 200 DM -353.571665 298.106631 -1.18605770
#> checking_balance1 - 200 DM 345.764781 189.667510 1.82300481
#> checking_balanceunknown 267.787941 184.277519 1.45317748
#> months_loan_duration 128.443937 6.091947 21.08421722
#> credit_historygood -218.538770 194.375158 -1.12431430
#> credit_historyperfect 454.927946 397.350631 1.14490304
#> credit_historypoor 240.898652 277.753405 0.86731125
#> credit_historyvery good -338.046944 368.941890 -0.91626067
#> purposecar 228.956516 258.924405 0.88426008
#> purposecar0 1665.502953 682.902023 2.43886077
#> purposeeducation -107.376634 365.882452 -0.29347304
#> purposefurniture/appliances -259.710244 250.336402 -1.03744498
#> purposerenovations -158.568107 530.335368 -0.29899591
#> savings_balance> 1000 DM 54.689447 320.264381 0.17076344
#> savings_balance100 - 500 DM -255.428754 231.792080 -1.10197360
#> savings_balance500 - 1000 DM -404.899320 285.296380 -1.41922347
#> savings_balanceunknown 521.820086 190.624372 2.73742587
#> employment_duration> 7 years -186.250739 239.358956 -0.77812313
#> employment_duration1 - 4 years -114.498338 203.016291 -0.56398596
#> employment_duration4 - 7 years 175.215510 235.859376 0.74288126
#> employment_durationunemployed 132.752173 362.411006 0.36630282
#> percent_of_income -794.834227 62.510716 -12.71516759
#> years_at_residence -6.375166 69.115085 -0.09223987
#> age 5.024025 6.854791 0.73292169
#> other_creditnone 187.374599 200.283600 0.93554639
#> other_creditstore 297.014773 364.689347 0.81443227
#> housingown -570.185717 236.300557 -2.41296815
#> housingrent -637.314122 278.513888 -2.28826694
#> existing_loans_count -53.091019 152.516872 -0.34809932
#> jobskilled -1391.930229 225.870875 -6.16250426
#> jobunemployed -1832.279355 523.734895 -3.49848630
#> jobunskilled -1408.702822 275.881198 -5.10619366
#> dependents -211.438114 193.136275 -1.09476127
#> phoneyes 517.160900 152.651925 3.38784395
#> defaultyes 409.485305 166.282896 2.46258223
#> Pr(>|t|)
#> (Intercept) 0.000000006939458025623265839936654186814735112420748919248580932617187500000000000000
#> checking_balance> 200 DM 0.235968236551983345794525348537717945873737335205078125000000000000000000000000000000
#> checking_balance1 - 200 DM 0.068693333574170192812324842179805273190140724182128906250000000000000000000000000000
#> checking_balanceunknown 0.146585089037442273252054292242974042892456054687500000000000000000000000000000000000
#> months_loan_duration 0.000000000000000000000000000000000000000000000000000000000000000000000000000003924887
#> credit_historygood 0.261232835320175160731537289393600076436996459960937500000000000000000000000000000000
#> credit_historyperfect 0.252607855827542193694057459651958197355270385742187500000000000000000000000000000000
#> credit_historypoor 0.386043817439114866907345913205062970519065856933593750000000000000000000000000000000
#> credit_historyvery good 0.359819308659901526681323957745917141437530517578125000000000000000000000000000000000
#> purposecar 0.376834016879434297031536971189780160784721374511718750000000000000000000000000000000
#> purposecar0 0.014960330531394868194827729723783704685047268867492675781250000000000000000000000000
#> purposeeducation 0.769240307074732343473044693382689729332923889160156250000000000000000000000000000000
#> purposefurniture/appliances 0.299856754947839199143544419712270610034465789794921875000000000000000000000000000000
#> purposerenovations 0.765024475096179989996869608148699626326560974121093750000000000000000000000000000000
#> savings_balance> 1000 DM 0.864454991925646121231352481117937713861465454101562500000000000000000000000000000000
#> savings_balance100 - 500 DM 0.270820172547056647882612878674990497529506683349609375000000000000000000000000000000
#> savings_balance500 - 1000 DM 0.156241722975992480826334940502420067787170410156250000000000000000000000000000000000
#> savings_balanceunknown 0.006336194577355417668906856221155976527370512485504150390625000000000000000000000000
#> employment_duration> 7 years 0.436737334872707494426435914647299796342849731445312500000000000000000000000000000000
#> employment_duration1 - 4 years 0.572929200325231091639466285414528101682662963867187500000000000000000000000000000000
#> employment_duration4 - 7 years 0.457781927357519768939653204142814502120018005371093750000000000000000000000000000000
#> employment_durationunemployed 0.714240538414836478864344826433807611465454101562500000000000000000000000000000000000
#> percent_of_income 0.000000000000000000000000000000000979342915146096460767000113634139779605902731418610
#> years_at_residence 0.926531642894123597109512502356665208935737609863281250000000000000000000000000000000
#> age 0.463831002770212275088113074161810800433158874511718750000000000000000000000000000000
#> other_creditnone 0.349802397489268446229004894121317192912101745605468750000000000000000000000000000000
#> other_creditstore 0.415651136680942334677979488333221524953842163085937500000000000000000000000000000000
#> housingown 0.016057663386048567910613371623185230419039726257324218750000000000000000000000000000
#> housingrent 0.022394241225180615967316199999004311393946409225463867187500000000000000000000000000
#> existing_loans_count 0.727861487334296564100100113137159496545791625976562500000000000000000000000000000000
#> jobskilled 0.000000001158131370407223614360341101825468967945198528468608856201171875000000000000
#> jobunemployed 0.000494983627366074618349200964217970977188087999820709228515625000000000000000000000
#> jobunskilled 0.000000415395249908262892673399546694668060808908194303512573242187500000000000000000
#> dependents 0.273966249915296444772394579558749683201313018798828125000000000000000000000000000000
#> phoneyes 0.000740521458109598277111396136263010703260079026222229003906250000000000000000000000
#> defaultyes 0.014013668520204447184851481722489552339538931846618652343750000000000000000000000000
-> Conclusion: From the first lm model, we can see that the adjusted R-squared are quite low (below 80%) and there are a lot of variables that don’t have significance level (indicated by the number of Pr(>|t|) that are above 0.05 - the lower are the more significant). Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model.
Variable to be considered based on significance level: - months_loan_duration - percent_of_income - job - phone - default
We redo the linier regression modelling using the selected variables:
credit2 <- credit %>% select(months_loan_duration,percent_of_income,job,phone,default,amount )
data_train2 <- credit2[index, ]
data_test2 <- credit2[-index, ]
set.seed(123)
credit_lm2 <- lm(amount ~ ., data = data_train2)
summary(credit_lm2)#>
#> Call:
#> lm(formula = amount ~ ., data = data_train2)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5223.6 -1067.5 -155.5 658.1 11339.5
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3869.505 312.701 12.374 < 0.0000000000000002 ***
#> months_loan_duration 137.348 5.834 23.544 < 0.0000000000000002 ***
#> percent_of_income -816.179 60.799 -13.424 < 0.0000000000000002 ***
#> jobskilled -1619.202 208.366 -7.771 0.0000000000000242 ***
#> jobunemployed -1621.364 495.775 -3.270 0.00112 **
#> jobunskilled -1653.229 258.321 -6.400 0.0000000002659802 ***
#> phoneyes 647.825 151.359 4.280 0.0000209740597980 ***
#> defaultyes 311.598 153.555 2.029 0.04277 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1914 on 792 degrees of freedom
#> Multiple R-squared: 0.553, Adjusted R-squared: 0.549
#> F-statistic: 139.9 on 7 and 792 DF, p-value: < 0.00000000000000022
-> the comparison of the both adjusted R square of the two models are as follows: - credit_lm: Adjusted R-squared: 0.5672 - credit_lm2: Adjusted R-squared: 0.549
Conclusion: we can see from the comparison that the difference between the two models are not significant thus using credit_lm2 with simpler model is more acceptable due to with the simplicity of the model but can achieve similar result.
The performance of our model can be evaluate using RMSE (Root Mean Squared Error)
\[ RMSE = \sqrt{\frac{ \sum (\hat y - y)^2} {n}} \] We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.
-> Performance of Model credit_lm
lm_pred <- predict(credit_lm, newdata = data_test %>% select(-amount))
# RMSE of train dataset
RMSE(pred = credit_lm$fitted.values, obs = data_train$amount)#> [1] 1832.503
#> [1] 1753.368
-> Performance of Model credit_lm2
lm_pred2 <- predict(credit_lm2, newdata = data_test2 %>% select(-amount))
# RMSE of train dataset
RMSE(pred = credit_lm2$fitted.values, obs = data_train2$amount)#> [1] 1904.685
#> [1] 1719.448
Performance Conclusion:
To determine which model is better, let’s evaluate both based on the Root Mean Squared Error (RMSE) values for both the training and testing datasets. RMSE provides an indication of the model’s prediction error, where a lower value generally indicates better performance. However, we need to look at both training and test RMSE to evaluate overfitting and generalization ability.
Here are the RMSE values:
Model: credit_lm RMSE (Train): 1832.503 RMSE (Test): 1753.368
Model: credit_lm2 RMSE (Train): 1904.685 RMSE (Test): 1719.448
Comparison and Interpretation:
Generalization Performance (Test RMSE): For credit_lm, the RMSE on the test data is 1753.368. For credit_lm2, the RMSE on the test data is 1719.448. Model credit_lm2 has a lower test RMSE compared to credit_lm, indicating it performs slightly better on unseen data.
Training Performance (Train RMSE): For credit_lm, the RMSE on the training data is 1832.503. For credit_lm2, the RMSE on the training data is 1904.685. credit_lm has a lower training RMSE, meaning it fits the training data better than credit_lm2. However, this does not necessarily mean it generalizes well to new data.
Overfitting Evaluation: The gap between train and test RMSE gives an indication of overfitting: credit_lm has a higher gap between training (1832.503) and test (1753.368) RMSE values. credit_lm2 has a smaller gap between training (1904.685) and test (1719.448) RMSE values. credit_lm2 appears to generalize better since the test performance is close to its training performance, while credit_lm has a larger gap, suggesting it may be slightly overfitting to the training data.
Conclusion: credit_lm2 is likely the better model because it has a lower test RMSE (1719.448) compared to credit_lm (1753.368), which means it performs better on new, unseen data. Additionally, the smaller gap between training and test RMSE for credit_lm2 indicates it generalizes better and is less prone to overfitting than credit_lm.
Thus, based on the RMSE values, credit_lm2 is a better model overall due to its lower test error and better generalization capability.
resact <- data.frame(residual = credit_lm2$residuals, fitted = credit_lm2$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())
Residuals are around value of 0.
Conclusion: Due to the residuals are around value of 0, thus linearity assumption is fulfilled.
Status: Linearity assumption
[Fulfilled].
Additional test for linearity
Linear regression model is expected to have error that are
normally distributed. the above chart displayed that the error
are focussed on value of 0 thus linearity assumption are fulfilled.
#>
#> Shapiro-Wilk normality test
#>
#> data: credit_lm2$residuals
#> W = 0.89477, p-value < 0.00000000000000022
Conclusion: the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.
Status: Normality assumption
[Not Fulfilled].
#> lag Autocorrelation D-W Statistic p-value
#> 1 0.001180496 1.997127 0.912
#> Alternative hypothesis: rho != 0
Conclusion: The result shows that the null hypothesis is accepted, meaning that our residual has no autocorrelation.
Status: No autocorrelation assumption
[Fulfilled].
#>
#> studentized Breusch-Pagan test
#>
#> data: credit_lm2
#> BP = 138.57, df = 7, p-value < 0.00000000000000022
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))
Conclusion: We can observe that on lower fitted values,
the residuals are concentrated around the value of 0. As the fitted
value increases, the residuals are also got bigger. Second way to detect
heterocesdasticity is using the Breusch-Pagan test, with null hypothesis
is there is no heterocesdasticity. With p-value < 0.05, we can
conclude that heterocesdasticity is present in our model.
Status: Homoscedasticity of Residuals assumption
[Not Fulfilled].
#> GVIF Df GVIF^(1/(2*Df))
#> months_loan_duration 1.112548 1 1.054774
#> percent_of_income 1.027793 1 1.013801
#> job 1.255754 3 1.038686
#> phone 1.210974 1 1.100443
#> default 1.054460 1 1.026869
Conclusion:Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. Thus no multicollinearity detected.
Status: No Multicollinearity assumption
[Fulfilled].
Final Conclusion of Assumption From 5 assumption
tests, it is concluded that: 1. Linearity -> [Fulfilled]
2. Normality of Residuals -> [Not Fulfilled] 3. No
Autocorrelation -> [Fulfilled] 4. Homoscedasticity of
Residuals -> [Not Fulfilled] 5. No Multicollinearity
-> [Fulfilled]
Based on assumption test, credit_lm2 model is able to fulfill 3 from 5 total test which is linearity, no autocorrelation and no multicollinearity. But from 5 assumption, the two that is mandatory to be fulfilled are linearity and no multicollinearity. Thus the model is usable.
Variables that are useful to describe the variances in credit amount are months_loan_duration, percent_of_income, job, phone, default. Our final model has satisfied the classical assumptions. The R-squared of the model is sufficient, with 54.9% of the variables can explain the variances in the credit amount. The accuracy of the model in predicting the car price is measured with RMSE, in which that our model may slightly overfit the traning dataset.
We have already learn how to build a linear regression model and what need to be concerned when building the model.