In the competitive landscape of home financing, understanding and predicting loan eligibility is crucial for both lenders and borrowers. For a home loan company, implementing a robust loan eligibility prediction system can significantly enhance the efficiency of the loan approval process, improve customer satisfaction, and streamline operations.
In the realm of real estate, securing a home loan represents a significant milestone for many individuals and families. As one of the most substantial financial commitments people undertake, the process of determining eligibility for a home loan is both crucial and complex. Home loan eligibility prediction is a key aspect of modern lending practices, utilizing advanced technologies and data analysis to streamline and enhance the decision-making process for both lenders and borrowers.
What is Loan Eligibility Prediction? Loan eligibility prediction refers to the process of assessing whether a prospective borrower meets the necessary criteria to qualify for a home loan. This assessment is typically based on various factors, including credit history, income, employment status,Applicant income, and others. By predicting eligibility, home loan companies can make more informed decisions, minimize risk, and provide better guidance to potential borrowers.
• Identify and analyze which variables are most significant in predicting the loan approval status.
• Detect and examine influential points within the data-set.
• Develop and fit a predictive model to the dataset.
• Evaluate the accuracy of the fitted model .
• Apply the developed model to a new data-set with unknown loan approval statuses to predict loan approval status for these new applications.
Dream Housing Finance company deals in all home loans. Th3ey have a presence across all urban, semi-urban, and rural areas. Customer-first applies for a home loan after that company validates the customer eligibility for a loan.
The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and others. To automate this process, they have given a problem to identify the customer’s segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.
•Loan_ID : Unique Loan ID
•Gender : Male / Female
•Married : Applicant married Yes or No (Y/N)
•Dependents : Number of dependents
1.”0” represents individuals with no dependents.
“1” represents individuals with one dependent.
“2” represents individuals with two dependents.
“+3” represents individuals with three or more dependents.
•Education : Applicant Education (Graduate/ Under Graduate)
•Self_Employed :Self-employed (Y/N)
•ApplicantIncome : Applicant income
•CoapplicantIncome : It refers to the income reported by a co-applicant who is applying alongside the primary applicant. A co-applicant might be a spouse, partner, or another individual who will share responsibility for repaying a loan or fulfilling other financial obligations.
•LoanAmount : Loan amount in thousands
•Loan_Amount_Term : Term of a loan in months. This is the duration over which the borrower agrees to repay the loan, usually expressed in months.
•Credit_History : If the credit history meets the guidelines, then the value is 1; otherwise, the value is 0.
•Property_Area : Urban/ Semi-Urban/ Rural
•Loan_Status : Loan approved (Y/N)
Train Data ( The data-set, which is used to train or fit the model)
• Original data-set has 13 rows and 614 columns.
Loan_ID Gender Married Dependents
0 0 0 0
Education Self_Employed ApplicantIncome CoapplicantIncome
0 0 0 0
LoanAmount Loan_Amount_Term Credit_History Property_Area
22 14 50 0
Loan_Status
0
• We have 86 NA(not available) values in the data-set: 22 from LoanAmount, 14 from Loan_Amount_Term, and 50 from Credit_History. We replaced the NA values in LoanAmount with the mean of the available values, and similarly replaced the NA values in Loan_Amount_Term with their mean. For the remaining NA values, we removed the rows containing them.
• There are 56 empty cells, so we removed the rows containing these empty cells.
• After removing rows with NA values and missing values, we remove the first column “Loan_ID”.
• Converting Character Variables to Factors:We have 8 character variables (like Gender, Married, etc.) that we have converted to factors. to factor variables. This operation does not change the dimensions of the data-set.
• Final Data-set Dimensions: After cleaning steps, you have a data-set of 12 rows and 511 columns.
[1] "Empty Cells"
[1] 56
'data.frame': 511 obs. of 12 variables:
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Married : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
$ Dependents : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
$ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
$ Self_Employed : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
$ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
$ CoapplicantIncome: num 0 1508 0 2358 0 ...
$ LoanAmount : num 146 128 66 120 141 ...
$ Loan_Amount_Term : num 360 360 360 360 360 360 360 360 360 360 ...
$ Credit_History : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
$ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
$ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
Interpretation
• The donut chart representing the marital gender of the applicants, divided into two categories: “Female” and “Male”. The chart indicates that 18% of the data is “Female”, and 82% is “Male”.
•The donut chart representing the marital status of the applicants, divided into two categories: “Yes” and “No”. The chart indicates that 65% of the data is married (Yes), and 35% is not married (No).
Interpretation
•The bar chart representing the distribution of a variable called “Dependents” across different categories. 294 individuals have no dependents.85 individuals have one dependent.88 individuals have two dependents.44 individuals have three or more dependents.The majority of individuals in the dataset have no dependents (294).Fewer individuals have one or two dependents (85 and 88, respectively).The smallest group consists of individuals with three or more dependents(44).
•The donut chart representing the graduation status of the applicants, divided into two categories: “Graduate” and “Not Graduate”. The chart indicates that 78% of the data is “Graduate”, and 22% is “Not Graduate” .
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Interpretation
• This is a density plot overlaid on a histogram, showing the distribution of Applicant Income.The distribution is highly right-skewed, indicating that most applicants have lower incomes, with fewer applicants having very high incomes.The majority of the data lies within the range of 0 to 20,000, as evidenced by the height of the bars and the peak of the density curve.There are a few applicants with much higher incomes, up to around 80,000.
• This is a density plot overlaid on a histogram, showing the distribution of coapplicant Income.The distribution is heavily right-skewed. This means that most coapplicant incomes are concentrated at the lower end, with a long tail extending to the right.A large number of coapplicants have very low or no income.
Interpretation • This is a density plot overlaid on a histogram, which shows the distribution of loan amounts.Most loan amounts fall between approximately 100 and 200 (in thousands).The peak of the density curve indicates the most common loan amount range, around 150 to 170.
• This is a density plot overlaid on a histogram, showing the distribution of Loan_Amount_Term.The histogram indicates that the majority of loan terms are clustered around 360 months (30 years).The density curve peaks sharply around the 360 mark.There’s a smaller peak around 180 months (15 years).
Interpretation
• The pie chart shows the distribution of Self-employment statuses in terms of approval (Yes) and rejection (No). The chart is divided into two segments, with 14% representing “Yes” (self-employed) and 86% representing “No” (not self-employed).This suggests that The majority of individuals are not self-employed, while a smaller portion of applicants are self-employed.
• Category 1, representing the credit history meets the guidelines with 437 instances. Category 0, has significantly fewer values(80).
Interpretation • The plot depicts a bar chart representing the distribution of values across three property areas: Rural, Semiurban, and Urban.
The Semiurban category has the highest value (197) The Urban area has the value of 165. The Rural area has the lowest value of 149.
•The pie chart shows the distribution of loan statuses in terms of approval (Yes) and rejection (No). The chart is divided into two segments, with 68% representing “Yes” (loans that were approved) and 32% representing “No” (loans that were not approved).This suggests that more than two-thirds of the loan applications were successful, while about one-third were not.
Gender Married Dependents Education Self_Employed
Female: 91 No :180 0 :294 Graduate :401 No :441
Male :420 Yes:331 1 : 85 Not Graduate:110 Yes: 70
2 : 88
3+: 44
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
Min. : 150 Min. : 0 Min. : 9.0 Min. : 36
1st Qu.: 2886 1st Qu.: 0 1st Qu.:100.0 1st Qu.:360
Median : 3858 Median : 1086 Median :129.0 Median :360
Mean : 5308 Mean : 1562 Mean :144.2 Mean :342
3rd Qu.: 5820 3rd Qu.: 2254 3rd Qu.:165.5 3rd Qu.:360
Max. :81000 Max. :33837 Max. :600.0 Max. :480
Credit_History Property_Area Loan_Status
0: 80 Rural :149 N:164
1:431 Semiurban:197 Y:347
Urban :165
Since our response variable is binary, taking only the values 0 and 1, we use a logistic regression model.
The logistic regression model is ,
log \(\frac{π(x)}{1 − π(x)}\)= α + βx.
Solving for π, this gives
π(x) =\(\frac{e^{α+βx}}{1 + e^{α+βx}}\).
If we want class prediction, we should predict Y = 1 when π ≥ 0.5 and Y = 0 when π < 0.5. This means guessing 1 whenever α + βx is non-negative, and 0 otherwise.
If we have \(p\) covariates \(x_1\), \(x_2\), …, \(x_p\) and a single response variable y, then we can fit a logistic model
π(x) = \(\frac{e^{β_0+β_1x_1+β_2x_2+....+β_px_p}}{1 +e^{β_0+β_1x_1+β_2x_2+....+β_px_p}}\).
Here π(x) represents the probability π(x) = P(Y = 1|\(x_1\), \(x_2\), …, \(x_p\)). The variables \(x_1\), \(x_2\), …., \(x_p\). on the other hand can be either quantitative or qualitative. The parameters \(β_i\) , i = 1, 2, .., p signifies the effect of \(x_i\) on the probability π(x). The interpretation of the intercept term \(β_0\) remains exactly the same as in case of single covariate.
If any of our covariates is a factor (or categorical variable), then we write the model in the same way as we write in case of linear models: including dummy variables. More specifically if we have a factor covariate A with k levels \(A_1\), \(A_2\), …, \(A_k\) having potential effect on the response y, then we use \(k−1\) indicator variables or dummy variables \(x_1\),\(x_2\), …, \(x_{k−1}\) where \[x_i =\begin{cases} 1 & \text{if the observation receives the $i^{th}$ level $A_i$}\\ 0 & \text{otherwise} \end{cases}\]
Here the \(k^{th}\) level \(A_k\) is the baseline level or reference level compared to which we measure the effects of other levels. Obviously we can take any one of the k levels as baseline and compare other levels with it.So finally we have (p-1)+(k-1) covariates in our model.
Call: glm(formula = Loan_Status ~ Property_Area + Credit_History +
Loan_Amount_Term + LoanAmount + CoapplicantIncome + ApplicantIncome +
Self_Employed + Education + Dependents + Married + Gender,
family = "binomial", data = loan)
Coefficients:
(Intercept) Property_AreaSemiurban Property_AreaUrban
-2.636e+00 9.355e-01 5.715e-02
Credit_History1 Loan_Amount_Term LoanAmount
3.732e+00 -2.280e-04 -2.614e-03
CoapplicantIncome ApplicantIncome Self_EmployedYes
-4.823e-05 -1.295e-06 -5.559e-02
EducationNot Graduate Dependents1 Dependents2
-4.493e-01 -2.744e-01 2.539e-01
Dependents3+ MarriedYes GenderMale
7.942e-02 5.463e-01 2.504e-01
Degrees of Freedom: 510 Total (i.e. Null); 496 Residual
Null Deviance: 641.4
Residual Deviance: 465 AIC: 495
• Our model includes 11 covariates, 7 of which are factor variables. We introduce dummy variables for each of these factor variables. As a result, the total number of covariates in the model increases to 14.
A data point that unduly influences the regression analyses outputs. A point is considered influential if its exclusion causes major changes in the fitted regression function. Depending on the location of the point, it may affect all statistics, including the p-value, r-square, coefficients, and intercept.
A measure of detecting influential point is Cook’s distance
\(CD_i =\frac{h_{ii}}{1-h_{ii}}(\frac{e_i^{(s)^2}}{p+1})\)
If,\(CD_i\) > \(\frac{4}{(n-p-1)}\) then the \(i^{th}\) data point can be designated as as influential point.
n : total number of observations
p : number of variables
\(h_{ii}\) : \(h_{ii}\) is the leverage, i.e., the \(i^{th}\) diagonal element of the hat matrix \(X(X^TX)^{−1}X^T\) , X is the matrix of covariates.
\(e_i^{(s)}\) = \(\frac{e_i}{s_{(i)}\sqrt{1-h_{ii}}}\) ,where \(s_{(i)}^2\) = \(\frac{RSS_{(i)}}{n-p-2}\) and \({RSS_{(i)}}\) = residual sum of square obtained by running least square method using the data-set excluding the \(i^{th}\) point.
Interpretation
The plot shows Cook’s distance values on the y-axis for each observation . Larger Cook’s distance values indicate observations that have a greater influence on the model’s coefficients. In this plot, observations 156, 184, and 582 stand out with noticeably higher Cook’s distance values than the others, suggesting these points are influential.
[1] 41
There are 41 influential points.
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call:
glm(formula = Loan_Status ~ Property_Area + Credit_History +
Loan_Amount_Term + LoanAmount + CoapplicantIncome + ApplicantIncome +
Self_Employed + Education + Dependents + Married + Gender,
family = "binomial", data = loan)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.517e+01 2.144e+03 -0.016 0.986917
Property_AreaSemiurban 1.608e+00 4.401e-01 3.655 0.000257 ***
Property_AreaUrban 2.065e-01 3.641e-01 0.567 0.570657
Credit_History1 3.851e+01 2.144e+03 0.018 0.985673
Loan_Amount_Term -6.100e-03 4.309e-03 -1.416 0.156916
LoanAmount -7.821e-03 3.435e-03 -2.277 0.022780 *
CoapplicantIncome 2.639e-04 1.335e-04 1.976 0.048151 *
ApplicantIncome 6.891e-05 8.427e-05 0.818 0.413564
Self_EmployedYes 2.336e+00 1.061e+00 2.202 0.027684 *
EducationNot Graduate -4.188e-01 3.940e-01 -1.063 0.287811
Dependents1 4.050e-01 4.946e-01 0.819 0.412870
Dependents2 6.917e-01 5.152e-01 1.343 0.179367
Dependents3+ 1.764e+01 1.484e+03 0.012 0.990513
MarriedYes 4.618e-01 3.753e-01 1.230 0.218534
GenderMale -6.978e-02 4.294e-01 -0.163 0.870899
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 554.33 on 469 degrees of freedom
Residual deviance: 269.67 on 455 degrees of freedom
AIC: 299.67
Number of Fisher Scoring iterations: 19
Interpretation
At the 5% level of significance, we can conclude that 4 covariates are statistically significant i.e. their p-values are less than 0.05.
• When other variables remain fixed , for every one unit increase in LoanAmount , the log odds of Loan_Status decrease by 7.821e-03.
• When other variables remain fixed , for every one unit increase in CoapplicantIncome , the log odds of Loan_Status increase by 2.639e-04.
• Compared to having a property in a rural area, having a property in a semiurban area increases the log odds of Loan_Status by 1.608.
• Compared to a non-self-employed applicant, being self-employed increases the log odds of Loan_Status by 2.336.
To improve the fit of our model we use AIC backward method.
Backward stepwise regression removes predictors based on their significance. At each step, compute the AIC for the current model. Select the model with the lowest AIC value, as it represents a good balance between model fit and simplicity (fewer parameters). The process starts with all predictors in the model and iteratively removes the least significant predictor, one at a time, until all remaining predictors are significant.
ResourceSelection 0.3-6 2023-06-27
Start: AIC=299.67
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term +
LoanAmount + CoapplicantIncome + ApplicantIncome + Self_Employed +
Education + Dependents + Married + Gender
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- Gender 1 269.70 297.70
- ApplicantIncome 1 270.45 298.45
- Education 1 270.77 298.77
- Married 1 271.19 299.19
<none> 269.67 299.67
- Loan_Amount_Term 1 272.14 300.14
- CoapplicantIncome 1 274.16 302.16
- LoanAmount 1 275.66 303.66
- Dependents 3 280.85 304.85
- Self_Employed 1 278.66 306.66
- Property_Area 2 287.54 313.54
- Credit_History 1 519.50 547.50
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=297.7
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term +
LoanAmount + CoapplicantIncome + ApplicantIncome + Self_Employed +
Education + Dependents + Married
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- ApplicantIncome 1 270.47 296.47
- Education 1 270.83 296.83
- Married 1 271.23 297.23
<none> 269.70 297.70
- Loan_Amount_Term 1 272.14 298.14
- CoapplicantIncome 1 274.22 300.22
- LoanAmount 1 275.74 301.74
- Dependents 3 280.86 302.86
- Self_Employed 1 278.72 304.72
- Property_Area 2 287.96 311.96
- Credit_History 1 519.86 545.86
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=296.47
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term +
LoanAmount + CoapplicantIncome + Self_Employed + Education +
Dependents + Married
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- Married 1 271.83 295.83
- Education 1 271.91 295.91
<none> 270.47 296.47
- Loan_Amount_Term 1 273.47 297.47
- CoapplicantIncome 1 274.23 298.23
- LoanAmount 1 276.61 300.61
- Dependents 3 281.78 301.78
- Self_Employed 1 281.28 305.28
- Property_Area 2 288.50 310.50
- Credit_History 1 519.86 543.86
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=295.83
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term +
LoanAmount + CoapplicantIncome + Self_Employed + Education +
Dependents
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- Education 1 273.31 295.31
<none> 271.83 295.83
- Loan_Amount_Term 1 275.59 297.59
- CoapplicantIncome 1 276.79 298.79
- LoanAmount 1 277.41 299.41
- Self_Employed 1 282.79 304.79
- Dependents 3 287.06 305.06
- Property_Area 2 290.75 310.75
- Credit_History 1 522.55 544.55
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=295.31
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term +
LoanAmount + CoapplicantIncome + Self_Employed + Dependents
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
<none> 273.31 295.31
- Loan_Amount_Term 1 276.53 296.53
- LoanAmount 1 277.99 297.99
- CoapplicantIncome 1 278.65 298.65
- Self_Employed 1 284.02 304.02
- Dependents 3 288.28 304.28
- Property_Area 2 293.69 311.69
- Credit_History 1 526.74 546.74
Interpretation
We selected a model with 7 important covariates: Loan_Amount_Term, LoanAmount, CoapplicantIncome, Self_Employed, Dependents, Property_Area, and Credit_History, which has an AIC of 295.31, indicates that this model is relatively efficient.
Interpretation
The residuals above 0 are tightly packed, indicating that the model predicts those values reasonably well. However, there is a noticeable spread of residuals below 0, particularly in the range between -1 and -4, suggesting that the model is under-predicting for those specific cases. There are some points that lie far from the rest of the data, particularly those below -2. These points might be considered outliers and could indicate instances where the model performs poorly. The fact that most residuals are centered around 0 and are relatively evenly distributed suggests that the model has captured most of the variance in the data reasonably well.
A half-normal probability plot of the deviance residuals with a simulated envelope is useful both for examining the adequacy of the linear part of the logistic regression model and for identifying deviance residuals that are outlying. A half-normal probability plot helps to highlight outlying deviance residuals even though the residuals are not normally distributed. In a normal probability plot, the kth ordered residual is plotted against the percentile z [(k − 0.375) / (n + 0.25)] or against √MSE times this percentile. In a half-normal probability plot, the kth ordered absolute residual is plotted against:
z \((\frac{k + n − 1/8} {2n + 1/2})\)
Outliers will appear at the top right of a half-normal probability plot as points separated from the others. However, a half-normal plot of the absolute residuals will not necessarily give a straight line even when the fitted model is in fact correct. To identify outlying deviance residuals, we combine a half-normal probability plot with a simulated envelope This envelope constitutes a band such that the plotted residuals are all likely to fall within the band if the fitted model is correct.
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:patchwork':
area
Binomial model
Interpretation
The plot includes confidence bands around the line (shown as dotted lines). These bands represent the expected range of variation under the fitted model. Most of the residuals fall within these confidence bands, suggesting that the model is capturing the majority of the variability in the data accurately. There are some deviations from the line, particularly at the higher quantiles (above approximately 2.0 on the x-axis), where the residuals tend to rise above the line. This deviation might indicate that the model slightly underestimates the higher values or that there could be some issues with the tails of the distribution that are not being captured well by the model.
It is a useful tool in classification problems to evaluate the performance of a classification model. It provides a summary of prediction results by comparing the predicted classifications against the actual values. Components of a Confusion Matrix In a binary classification problem, the confusion matrix is typically a 2x2 matrix with the following entries: True Positives (TP): The number of instances correctly predicted as positive. True Negatives (TN): The number of instances correctly predicted as negative. False Positives (FP): The number of instances incorrectly predicted as positive. False Negatives (FN): The number of instances incorrectly predicted as negative.
• Accuracy – It determines the overall predicted accuracy of the model. It is calculated as
Accuracy = \(\frac{True \, Positives+True \, Negatives}{True\, Positives+True\, Negatives+False\, Positives+False\, Negatives}\)
• True Positive Rate (TPR) – It indicates how many positive values,out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is
TPR = \(\frac{True\,Positives}{True\, Positives+False\, Negatives}\)
It is also known as Sensitivity or Recall.
• False Positive Rate (FPR) – It indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is
FPR = \(\frac{False\,Positives}{True\, Negatives+False\, Positives}\)
• True Negative Rate (TNR) – It indicates how many negative values,out of all the negative values, have been correctly predicted. The formula to calculate the true negative rate is
TNR = \(\frac{True\, Negatives}{True\, Negatives+False\, Positives}\)
It is also known as Specificity.
False Negative Rate (FNR) – It indicates how many positive values,out of all the positive values, have been incorrectly predicted. The formula to calculate false negative rate is
FNR = \(\frac{False\, Negatives}{True\, Positives+False\, Negatives}\)
• Precision: It indicates how many values, out of all the predicted positive values, are actually positive. It is formulated as:
Precision = \(\frac{True\, Positives}{True\, Positives+False\, Positives}\)
• F Score : F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. It is formulated as
\(\frac{2*(precision*recall) } {(precision+recall)}\).
Confusion Matrix
Predicted
Actual 0 1
0 74 56
1 0 340
Accuracy-
[1] 0.8808511
The model is accurate, correctly classifying approximately 88.09% of the total instances.
True Positive Rate (TPR) –
[1] 1
The model identifies all actual positive cases correctly, with no false negatives. This means that every positive instance is correctly classified as positive.
False Positive Rate (FPR) -
[1] 0.4307692
About 43.08% of the actual negative cases are incorrectly classified as positive.
True Negative Rate (TNR) –
[1] 0.5692308
The model correctly identifies around 56.92% of the actual negative cases.
False Negative Rate (FNR) –
[1] 0
There are no false negatives, meaning the model does not miss any actual positive cases.
Precision -
[1] 1
F Score -
[1] 1
A value of 1.0 indicates that the model is perfect in handling positive cases.
Area Under the Curve (AUC): The AUC quantifies the overall ability of the model to discriminate between positive and negative classes. AUC ranges from 0 to 1, where a value of 0.5 indicates no discriminative power (similar to random guessing), and a value of 1 indicates perfect discrimination.
ROC determines the accuracy of a classification model at a user defined threshold value. It determines the model’s accuracy using Area Under Curve (AUC).The area under the curve (AUC), also referred to as index of accuracy (A) or concordant index, represents the performance of the ROC curve. Higher the area, better the model. ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).
Interpretation
The curve is close to the upper left corner, which indicates a good model, as it maximizes the TPR while minimizing the FPR. The AUC value is 0.90108, which is a high score.This suggests that the model has a very good ability to distinguish between the positive and negative classes. A perfect model would have an AUC of 1.The ROC curve and the AUC value indicate that the model has good predictive performance.
At first we create a feature matrix where the categorical features are converted to numeric with one-hot encoding (It transforms categorical data into a binary matrix representation where each category is represented by a unique binary column.) We remove the first column of the dummies to avoid multicollinearity.
Then we the drop reference dummies and combine dummies with numeric variables then convert them to a matrix X.The outcome variable is put into a vector called Y.
Finally we create group vector that distinguish groups of predictors in the data. It separates the predictors into different groups based on some criteria such as the type of predictor or the relationship between predictors. This step is necessary for the group LASSO step, as it allows for the lasso to apply different penalties to different groups of predictors.
Attaching package: 'dplyr'
The following object is masked from 'package:MASS':
select
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator. This particular type of regression is well-suited for models showing high levels of muticollinearity. It performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.
Group lasso is a variable selection method in linear regression models that extends the lasso to select variables in groups.In situations where features are naturally grouped (e.g., polynomial features, categorical variables converted into dummy variables, or genetic data grouped by genes), it might be desirable to either select or discard an entire group of variables together. Mathematical Formulation: The model for Group Lasso is:
\(\underset{\beta}{min} (\frac{1}{2n}\sum_{i=1}^{n}(y_i−\sum_{𝑗=1}^{𝑝}𝑋_{ij}𝛽_𝑗)^2+\lambda \sum_{𝑔=1}^{𝐺}\sqrt{𝑝_g}∥\beta_𝑔∥_2)\)
where:
• n is the number of observations.
• \(y_i\) is the response variable for the i-th observation.
• \(X_{ij}\) is the value of the j-th predictor for the i-th observation.
• \(β_j\) is the coefficient for the j-th predictor.
• λ is the regularization parameter that controls the strength of the penalty.
• \(β_g\) is the vector of coefficients for the g-th group of predictors.
• \(∥𝛽_𝑔∥_2\) represents the Euclidean (L2) norm of the coefficients in the g-th group.
• \(p_g\) is the number of predictors in the g-th group.
Attaching package: 'grpreg'
The following object is masked from 'package:dplyr':
select
The following object is masked from 'package:MASS':
select
Logistic regression modeling Pr(y=Male)
Cross validation is a technique used in machine learning to evaluate the performance of a model on unseen data. It involves dividing the available data into multiple folds or subsets, using one of these folds as a validation set, and training the model on the remaining folds. This process is repeated multiple times, each time using a different fold as the validation set. Finally, the results from each validation step are averaged to produce a more robust estimate of the model’s performance.
Logistic regression modeling Pr(y=Male)
grLasso-penalized logistic regression with n=470, p=14
At minimum cross-validation error (lambda=0.0112):
-------------------------------------------------
Nonzero coefficients: 10
Nonzero groups: 7
Cross-validation error of 0.78
Maximum R-squared: 0.14
Maximum signal-to-noise ratio: 0.19
Prediction error at lambda.min: 0.183
Minimum lambda value(the value of lambda that gives the lowest mean cross-validated error)
[1] 0.01115884
(Intercept) ApplicantIncome CoapplicantIncome
1.2882341569 0.0000000000 0.0002796523
LoanAmount Loan_Amount_Term Married_Yes
0.0000000000 -0.0021408138 1.4390024087
Dependents_1 Dependents_2 Dependents_3+
-0.1435232569 0.7378090264 0.5760988879
Education_Not Graduate Self_Employed_Yes Credit_History_1
0.0455341745 0.0000000000 0.0093371183
Property_Area_Semiurban Property_Area_Urban Loan_Status_Y
-0.3770140832 -0.0613556516 0.0000000000
We can see that there are 7 effective covariates: CoapplicantIncome, Loan_Amount_Term,Married,Dependents,Education,Credit_History, and Property_Area.
We fit a logistic regression model using these 7 covariates.
Confusion Matrix
Predicted
Actual 0 1
0 73 57
1 0 340
Receiver Operator Characteristic (ROC) Curve and Area Under
the Curve
The model we fit using the AIC backward method included 7 covariates : CoapplicantIncome , LoanAmount , Self_Employed , Loan_Amount_Term,Dependents , Credit_History , and Property_Area. We checked the model’s prediction performance and found that the AUC value was 0.9010. However, using the group lasso method, we also obtained a model with 7 covariates, but this model included ‘Married’ and ‘Education’ instead of ‘LoanAmount’ and ‘Self_Employed’.We found AUC for this model 0.8898. Although the second model has a good AUC value, we decided to use the first model for predicting loan status.
data-set has 12 rows and 367 columns
Loan_ID Gender Married Dependents
0 0 0 0
Education Self_Employed ApplicantIncome CoapplicantIncome
0 0 0 0
LoanAmount Loan_Amount_Term Credit_History Property_Area
5 6 29 0
• We have 40 NA(not available) values in the data-set: 29 from Credit_History,5 from LoanAmount,6 from Loan_Amount_Term.For the NA values, we removed the rows containing them.
• There are 41 empty cells, so we removed the rows containing these empty cells.
• Converting Character Variables to Factors:We have 8 character variables (like Gender, Married, etc.) that we have converted to factors. to factor variables. This operation does not change the dimensions of the data-set.
• Final Data-set Dimensions: After cleaning steps, you have a data-set of 11 rows and 298 columns.
[1] "Empty Cells"
[1] 41
'data.frame': 298 obs. of 11 variables:
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 2 2 2 1 ...
$ Married : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 1 ...
$ Dependents : Factor w/ 4 levels "0","1","2","3+": 1 2 3 1 1 2 3 1 1 1 ...
$ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 2 2 2 2 2 1 ...
$ Self_Employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
$ ApplicantIncome : int 5720 3076 5000 3276 2165 2226 3881 2400 3091 4666 ...
$ CoapplicantIncome: int 0 1500 1800 0 3422 0 0 2400 0 0 ...
$ LoanAmount : num 110 126 208 78 152 59 147 123 90 124 ...
$ Loan_Amount_Term : num 360 360 360 360 360 360 360 360 360 360 ...
$ Credit_History : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
$ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 2 1 2 3 2 ...
The following table represents Loan_IDs and their corresponding loan approval status.
In this project, we aimed to predict the Loan Status of applicants.The response variable is categorical with two possible outcomes,so we use logistic regression. We identify and remove influential points to improve model accuracy. To determine the most relevant predictors, we employed the AIC backward method, which selected seven covariates: CoapplicantIncome , Loan_Amount , Self_Employed , Loan_Amount_Term , Dependents , Credit_History and Property_Area . The model AIC is 295.31.
Then we applied the group Lasso to the data-set, which also selected seven covariates. However, this method identified “Married” and “Education” as significant predictors, instead of “Loan_Amount and” Self_Employed”. After comparing the predictive performance of both models, the first model was found to be more effective.
Finally, we used the first model to predict the Loan Status of applicants on a new data-set.
Here are the R codes used for this project.
library("rmarkdown")
lo=read.csv("C:\\Users\\HP\\Downloads\\loan-train (1).csv")
paged_table(lo)
colSums(is.na(lo))
LoanAmount.mean=mean(lo$LoanAmount,na.rm=TRUE)
lo$LoanAmount=replace(lo$LoanAmount,is.na(lo$LoanAmount)==1,LoanAmount.mean)
Loan_Amount_Term.mean=mean(lo$Loan_Amount_Term,na.rm=TRUE)
lo$Loan_Amount_Term=replace(lo$Loan_Amount_Term,is.na(lo$Loan_Amount_Term)==1,Loan_Amount_Term.mean)
loan=na.omit(lo)#remove NA
print("Empty Cells")
sum(loan=="")
index1=which(loan$Gender=="") # remove empty
index2=which(loan$Self_Employed=="")
index3=which(loan$Married=="")
index4=which(loan$Dependents=="")
loan=loan[-c(index1,index2,index3,index4),]
loan=loan[,-1]#remove 1st column
loan$Gender=as.factor(loan$Gender)#factor
loan$Married=as.factor(loan$Married)
loan$Dependents=as.factor(loan$Dependents)
loan$Education=as.factor(loan$Education)
loan$Self_Employed=as.factor(loan$Self_Employed)
loan$Credit_History =as.factor(loan$Credit_History)
loan$Property_Area =as.factor(loan$Property_Area)
loan$Loan_Status=as.factor(loan$Loan_Status)
str(loan)
library("ggplot2")
library("patchwork")
# gender
data1=data.frame("cat1"=c("Female","Male"),"val1"=c(sum(loan$Gender=="Female"),sum(loan$Gender=="Male")))
slices1=c(sum(loan$Gender=="Female"),sum(loan$Gender=="Male"))
frac1=(slices1/sum(slices1))
ymax1=cumsum(frac1)
ymin1=c(0,head(ymax1,n=-1))
labposi1=(ymax1+ymin1)/2
labls1=paste0(c("Female","Male"),"\n value:",paste(round(frac1*100)),"%",sep="")
y1=ggplot(data1,aes(ymax=ymax1,ymin=ymin1,xmax=4,xmin=3,fill=cat1))+geom_rect()+geom_label(x=3.5,aes(y=labposi1,label=labls1),size=3)+coord_polar(theta="y")+xlim(c(2,4))+theme_void()+theme(legend.position = "none")+labs(title=" Gender")+scale_fill_manual(values=c("yellow","purple"))
# married
data2=data.frame("cat2"=c("Yes","No"),"val2"=c(sum(loan$Married=="Yes"),sum(loan$Married=="No")))
slices2=c(sum(loan$Married=="Yes"),sum(loan$Married=="No"))
frac2=(slices2/sum(slices2))
ymax2=cumsum(frac2)
ymin2=c(0,head(ymax2,n=-1))
labposi2=(ymax2+ymin2)/2
labls2=paste0(c("Yes","No"),"\n value:",paste(round(frac2*100)),"%",sep="")
y2=ggplot(data2,aes(ymax=ymax2,ymin=ymin2,xmax=4,xmin=3,fill=cat2))+geom_rect()+geom_label(x=3.5,aes(y=labposi2,label=labls2),size=3,color="white")+coord_polar(theta="y")+xlim(c(2,4))+theme_void()+theme(legend.position = "none")+labs(title=" Marital Status")+scale_fill_manual(values = c("orange","blue"))
y1+y2
# dependents
data=data.frame("Category"=c("0","1","2","+3"),"values"=c(sum(loan$Dependents=="0"),sum(loan$Dependents=="1"),sum(loan$Dependents=="2"),sum(loan$Dependents=="3+")))
y3=ggplot(data,aes(x=Category,y=values,fill=Category))+geom_bar(stat="identity")+geom_text(aes(label=values),vjust=1.6,color="white")+scale_fill_manual(values=c("green","#993300","#666666","#9933FF"))+scale_x_discrete(limits =c("0","1","2","+3"))+labs(title=" Dependents",x="",y="Count")
#graduate
data3=data.frame("cat3"=c("Graduate","Not Graduate"),"val3"=c(sum(loan$Education=="Graduate"),sum(loan$Education=="Not Graduate")))
slices3=c(sum(loan$Education=="Graduate"),sum(loan$Education=="Not Graduate"))
frac3=(slices3/sum(slices3))
ymax3=cumsum(frac3)
ymin3=c(0,head(ymax3,n=-1))
labposi3=(ymax3+ymin3)/2
labls3=paste0(c("Graduate","Not Graduate"),"\n value:",paste(round(frac3*100)),"%",sep="")
y4=ggplot(data3,aes(ymax=ymax3,ymin=ymin3,xmax=4,xmin=3,fill=cat3))+geom_rect()+geom_label(x=3.5,aes(y=labposi3,label=labls3),size=3)+coord_polar(theta="y")+xlim(c(2,4))+theme_void()+theme(legend.position = "none")+labs(title=" Graduation Status")+scale_fill_manual(values=c("#99CC99","#CC0000"))
y3+y4
p1=ggplot(data=loan,aes(x=ApplicantIncome))+geom_histogram(aes(y=..density..),bins=24,col="black",fill="#00FF99")+labs(title="Applicant Income")+geom_density()
p2=ggplot(data=loan,aes(x=CoapplicantIncome))+geom_histogram(aes(y=..density..),bins=24,col="black",fill="#66CCFF")+labs(title="Coapplicant Income")+geom_density()
p1+p2
p3=ggplot(data=loan,aes(x=LoanAmount))+geom_histogram(aes(y=..density..),bins=24,col="black",fill="#FFFF99")+labs(title="Loan Amount")+geom_density()
p4=ggplot(data=loan,aes(x=Loan_Amount_Term))+geom_histogram(aes(y=..density..),bins=24,col="black",fill="#CCCCCC")+labs(title="Loan Amount In Term")+geom_density()
p3+p4
#self employed
data6=data.frame("Category"=c("Yes","No"),"Values6"=c(sum(loan$ Self_Employed =="Yes"),sum(loan$Self_Employed=="No")))
q1=ggplot(data=data6,aes(x="",y=Values6,fill=Category))+geom_col(color="black")+geom_label(aes(label=paste(round(Values6/sum(Values6)*100),"%",sep="")),position=position_stack(vjust=0.5),show.legend=FALSE)+coord_polar(theta="y")+theme_void()+labs(title="Self Employment Status")+scale_fill_manual(values=c("#FFFF99","#CC0066"))
#credit history
data=data.frame("Category"=c("1","0"),"Values"=c(sum(loan$Credit_History =="1"),sum(loan$Credit_History=="0")))
q2=ggplot(data=data, aes(x=Category,y=Values,fill=Category))+geom_bar(stat="identity")+scale_fill_manual(values=c("#996633","#660066"))+labs(title=" Credit History")+coord_flip()+geom_text(aes(label=Values),hjust=1.6,color="white")
q1+q2
#Property Area
data=data.frame("Category"=c("Rural","Semiurban","Urban"),"Values"=c(sum(loan$Property_Area=="Rural"),sum(loan$Property_Area=="Semiurban"),sum(loan$Property_Area=="Urban")))
q3=ggplot(data=data,aes(x=Category,y=Values,fill=Category))+geom_bar(stat="identity")+scale_fill_manual(values=c("#66CC66","#006699","#FF3333"))+labs(title="Property Area")+geom_text(aes(label=Values),vjust=1.6,color="white")
#loan status
data7=data.frame("Category"=c("Yes","No"),"Values7"=c(sum(loan$ Loan_Status=="Y"),sum(loan$Loan_Status=="N")))
q4=ggplot(data=data7,aes(x="",y=Values7,fill=Category))+geom_col(color="black")+geom_label(aes(label=paste(round(Values7/sum(Values7)*100),"%",sep="")),position=position_stack(vjust=0.5),show.legend=FALSE)+coord_polar(theta="y")+theme_void()+labs(title="Loan Status ")+scale_fill_manual(values=c("#9999FF","#996633"))
q3+q4
summary(loan)
#logistic
model=glm( Loan_Status~Property_Area+Credit_History+Loan_Amount_Term +LoanAmount+CoapplicantIncome+ ApplicantIncome+Self_Employed +Education + Dependents +Married +Gender,family="binomial",data=loan)
model
plot(model,4)
influential=(cooks.distance(model)>(4/(511-12-1)))
sum(influential)
loan=subset(loan,subset=(influential==FALSE))
model2=glm( Loan_Status~Property_Area+Credit_History+Loan_Amount_Term +LoanAmount+CoapplicantIncome+ ApplicantIncome+Self_Employed +Education + Dependents +Married +Gender,family="binomial",data=loan)
summary(model2)
library("ResourceSelection")
model3=step(model2,direction="backward")
plot(residuals(model3,"pearson"),main="Residuals Plot",xlab="Fitted Values",ylab="Pearson Residuals")
library("hnp")
hnp(model3)
library("ROCR")
pre=ifelse(fitted(model3)>0.5,1,0)
status=ifelse(loan$Loan_Status=="N",0,1)
tab=table(Predicted=pre,Actual=status)
tab=t(tab)
tab
sum(diag(tab))/sum(tab)
re=tab[2,2]/(tab[2,2]+tab[2,1])
re
tab[1,2]/(tab[1,1]+tab[1,2])
tab[1,1]/(tab[1,1]+tab[1,2])
tab[2,1]/(tab[2,2]+tab[2,1])
pe=tab[2,2]/(tab[2,1]+tab[2,2])
pe
(2*(re*pe))/(re+pe)
pred=predict(model3,type="response",newdata=loan)
pred1=prediction(pred,loan$Loan_Status)
pref=performance(pred1,"tpr","fpr")
plot(pref,print.cutoffs.at=seq(0,1,0.1),colorize=TRUE)
abline(a=0,b=1)
auc=performance(pred1,"auc")@y.values[[1]]
legend(0.6,0.4,auc,title="AUC",cex=0.9)
library("fastDummies")
library("dplyr")
#The function "select_if" is used to select only the factor variables in the dataframe. The function "dummy_cols" is then used to create new columns for each level of the factors variables, with the levels being represented by binary variables (0 or 1). The option "remove_first_dummy" is set to true so that one level of each factor variable is removed to prevent collinearity issues. The option "remove_selected_columns" is also set to true so that the original factor columns are removed from the dataframe. Finally, the option "select_columns" is set to the names of the factor columns.
d_factor2=select_if(loan,is.factor)
dummies=dummy_cols(d_factor2,remove_first_dummy=TRUE,remove_selected_columns=TRUE,select_columns=colnames(d_factor2))
dummies=dummies[,-1]
paged_table(dummies)
d2 =select_if(loan, is.numeric)
X = cbind(d2, dummies)
X = as.matrix(X)
d_factor2$Loan_Status = ifelse(d_factor2$Loan_Status == "Y", 1, 0)
y = d_factor2[,1]
group = dummies
colnames(group) = sub("_[^_]+$", "", colnames(group))
group = cbind(d2, group)
group = colnames(group)
library("grpreg")
fit=grpreg(X,y,group,penalty="grLasso",family="binomial")
cvfit=cv.grpreg(X,y,group,penalty="grLasso",family="binomial")
summary(cvfit)
plot(cvfit)
cvfit$lambda.min
coef(fit,lambda=cvfit$lambda.min)
model4=glm(Loan_Status~Property_Area+Credit_History+Loan_Amount_Term +CoapplicantIncome+Education+Dependents+Married,family="binomial",data=loan)
pre=ifelse(fitted(model4)>0.5,1,0)
status=ifelse(loan$Loan_Status=="N",0,1)
tab=table(Predicted=pre,Actual=status)
tab=t(tab)
tab
pred=predict(model4,type="response",newdata=loan)
pred1=prediction(pred,loan$Loan_Status)
pref=performance(pred1,"tpr","fpr")
plot(pref,print.cutoffs.at=seq(0,1,0.1),colorize=TRUE)
abline(a=0,b=1)
auc=performance(pred1,"auc")@y.values[[1]]
legend(0.6,0.4,auc,title="AUC",cex=0.9)
test=read.csv("C:\\Users\\HP\\OneDrive\\Documents\\lone_test.csv")
paged_table(test)
colSums(is.na(test))
test=read.csv("C:\\Users\\HP\\OneDrive\\Documents\\lone_test.csv")
LoanAmount.mean=mean(test$LoanAmount,na.rm=TRUE)
test$LoanAmount=replace(test$LoanAmount,is.na(test$LoanAmount)==1,LoanAmount.mean)
Loan_Amount_Term.mean=mean(test$Loan_Amount_Term,na.rm=TRUE)
test$Loan_Amount_Term=replace(test$Loan_Amount_Term,is.na(test$Loan_Amount_Term)==1,Loan_Amount_Term.mean)
test=na.omit(test)
print("Empty Cells")
sum(test=="")
index1=which(test$Gender=="") # remove empty
index2=which(test$Self_Employed=="")
index3=which(test$Married=="")
index4=which(test$Dependents=="")
test=test[-c(index1,index2,index3,index4),]
test$Gender=as.factor(test$Gender)#factor
test$Married=as.factor(test$Married)
test$Dependents=as.factor(test$Dependents)
test$Education=as.factor(test$Education)
test$Self_Employed=as.factor(test$Self_Employed)
test$Credit_History =as.factor(test$Credit_History)
test$Property_Area =as.factor(test$Property_Area)
te=test[,-1]
str(te)
predi=predict(model3,te)
val=ifelse((predi)>0.5,"YES","NO")
o=data.frame("Loan_Id"=test[,1],"Loan_Status"=val)
paged_table(o)