1.INTRODUCTION

In the competitive landscape of home financing, understanding and predicting loan eligibility is crucial for both lenders and borrowers. For a home loan company, implementing a robust loan eligibility prediction system can significantly enhance the efficiency of the loan approval process, improve customer satisfaction, and streamline operations.

In the realm of real estate, securing a home loan represents a significant milestone for many individuals and families. As one of the most substantial financial commitments people undertake, the process of determining eligibility for a home loan is both crucial and complex. Home loan eligibility prediction is a key aspect of modern lending practices, utilizing advanced technologies and data analysis to streamline and enhance the decision-making process for both lenders and borrowers.

What is Loan Eligibility Prediction? Loan eligibility prediction refers to the process of assessing whether a prospective borrower meets the necessary criteria to qualify for a home loan. This assessment is typically based on various factors, including credit history, income, employment status,Applicant income, and others. By predicting eligibility, home loan companies can make more informed decisions, minimize risk, and provide better guidance to potential borrowers.

2.OBJECTIVE

• Identify and analyze which variables are most significant in predicting the loan approval status.

• Detect and examine influential points within the data-set.

• Develop and fit a predictive model to the dataset.

• Evaluate the accuracy of the fitted model .

• Apply the developed model to a new data-set with unknown loan approval statuses to predict loan approval status for these new applications.

3.THE DATA-SET

Dream Housing Finance company deals in all home loans. Th3ey have a presence across all urban, semi-urban, and rural areas. Customer-first applies for a home loan after that company validates the customer eligibility for a loan.

The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and others. To automate this process, they have given a problem to identify the customer’s segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.

3.1.Description of the Variables in the data-set

•Loan_ID : Unique Loan ID

•Gender : Male / Female

•Married : Applicant married Yes or No (Y/N)

•Dependents : Number of dependents

1.”0” represents individuals with no dependents.

  1. “1” represents individuals with one dependent.

  2. “2” represents individuals with two dependents.

  3. “+3” represents individuals with three or more dependents.

•Education : Applicant Education (Graduate/ Under Graduate)

•Self_Employed :Self-employed (Y/N)

•ApplicantIncome : Applicant income

•CoapplicantIncome : It refers to the income reported by a co-applicant who is applying alongside the primary applicant. A co-applicant might be a spouse, partner, or another individual who will share responsibility for repaying a loan or fulfilling other financial obligations.

•LoanAmount : Loan amount in thousands

•Loan_Amount_Term : Term of a loan in months. This is the duration over which the borrower agrees to repay the loan, usually expressed in months.

•Credit_History : If the credit history meets the guidelines, then the value is 1; otherwise, the value is 0.

•Property_Area : Urban/ Semi-Urban/ Rural

•Loan_Status : Loan approved (Y/N)

Train Data ( The data-set, which is used to train or fit the model)

• Original data-set has 13 rows and 614 columns.

3.2.Train Data

3.3.Data Manipulation

          Loan_ID            Gender           Married        Dependents 
                0                 0                 0                 0 
        Education     Self_Employed   ApplicantIncome CoapplicantIncome 
                0                 0                 0                 0 
       LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
               22                14                50                 0 
      Loan_Status 
                0 

• We have 86 NA(not available) values in the data-set: 22 from LoanAmount, 14 from Loan_Amount_Term, and 50 from Credit_History. We replaced the NA values in LoanAmount with the mean of the available values, and similarly replaced the NA values in Loan_Amount_Term with their mean. For the remaining NA values, we removed the rows containing them.

• There are 56 empty cells, so we removed the rows containing these empty cells.

• After removing rows with NA values and missing values, we remove the first column “Loan_ID”.

• Converting Character Variables to Factors:We have 8 character variables (like Gender, Married, etc.) that we have converted to factors. to factor variables. This operation does not change the dimensions of the data-set.

• Final Data-set Dimensions: After cleaning steps, you have a data-set of 12 rows and 511 columns.

[1] "Empty Cells"
[1] 56
'data.frame':   511 obs. of  12 variables:
 $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Married          : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
 $ Dependents       : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
 $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
 $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
 $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
 $ CoapplicantIncome: num  0 1508 0 2358 0 ...
 $ LoanAmount       : num  146 128 66 120 141 ...
 $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
 $ Credit_History   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
 $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
 $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

4.EXPLORATORY DATA ANALYSIS

4.1.Plots

Interpretation

• The donut chart representing the marital gender of the applicants, divided into two categories: “Female” and “Male”. The chart indicates that 18% of the data is “Female”, and 82% is “Male”.

•The donut chart representing the marital status of the applicants, divided into two categories: “Yes” and “No”. The chart indicates that 65% of the data is married (Yes), and 35% is not married (No).

Interpretation

•The bar chart representing the distribution of a variable called “Dependents” across different categories. 294 individuals have no dependents.85 individuals have one dependent.88 individuals have two dependents.44 individuals have three or more dependents.The majority of individuals in the dataset have no dependents (294).Fewer individuals have one or two dependents (85 and 88, respectively).The smallest group consists of individuals with three or more dependents(44).

•The donut chart representing the graduation status of the applicants, divided into two categories: “Graduate” and “Not Graduate”. The chart indicates that 78% of the data is “Graduate”, and 22% is “Not Graduate” .

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

Interpretation

• This is a density plot overlaid on a histogram, showing the distribution of Applicant Income.The distribution is highly right-skewed, indicating that most applicants have lower incomes, with fewer applicants having very high incomes.The majority of the data lies within the range of 0 to 20,000, as evidenced by the height of the bars and the peak of the density curve.There are a few applicants with much higher incomes, up to around 80,000.

• This is a density plot overlaid on a histogram, showing the distribution of coapplicant Income.The distribution is heavily right-skewed. This means that most coapplicant incomes are concentrated at the lower end, with a long tail extending to the right.A large number of coapplicants have very low or no income.

Interpretation • This is a density plot overlaid on a histogram, which shows the distribution of loan amounts.Most loan amounts fall between approximately 100 and 200 (in thousands).The peak of the density curve indicates the most common loan amount range, around 150 to 170.

• This is a density plot overlaid on a histogram, showing the distribution of Loan_Amount_Term.The histogram indicates that the majority of loan terms are clustered around 360 months (30 years).The density curve peaks sharply around the 360 mark.There’s a smaller peak around 180 months (15 years).

Interpretation

• The pie chart shows the distribution of Self-employment statuses in terms of approval (Yes) and rejection (No). The chart is divided into two segments, with 14% representing “Yes” (self-employed) and 86% representing “No” (not self-employed).This suggests that The majority of individuals are not self-employed, while a smaller portion of applicants are self-employed.

• Category 1, representing the credit history meets the guidelines with 437 instances. Category 0, has significantly fewer values(80).

Interpretation • The plot depicts a bar chart representing the distribution of values across three property areas: Rural, Semiurban, and Urban.

The Semiurban category has the highest value (197) The Urban area has the value of 165. The Rural area has the lowest value of 149.

•The pie chart shows the distribution of loan statuses in terms of approval (Yes) and rejection (No). The chart is divided into two segments, with 68% representing “Yes” (loans that were approved) and 32% representing “No” (loans that were not approved).This suggests that more than two-thirds of the loan applications were successful, while about one-third were not.

4.2.Summary of the data-set

    Gender    Married   Dependents        Education   Self_Employed
 Female: 91   No :180   0 :294     Graduate    :401   No :441      
 Male  :420   Yes:331   1 : 85     Not Graduate:110   Yes: 70      
                        2 : 88                                     
                        3+: 44                                     
                                                                   
                                                                   
 ApplicantIncome CoapplicantIncome   LoanAmount    Loan_Amount_Term
 Min.   :  150   Min.   :    0     Min.   :  9.0   Min.   : 36     
 1st Qu.: 2886   1st Qu.:    0     1st Qu.:100.0   1st Qu.:360     
 Median : 3858   Median : 1086     Median :129.0   Median :360     
 Mean   : 5308   Mean   : 1562     Mean   :144.2   Mean   :342     
 3rd Qu.: 5820   3rd Qu.: 2254     3rd Qu.:165.5   3rd Qu.:360     
 Max.   :81000   Max.   :33837     Max.   :600.0   Max.   :480     
 Credit_History   Property_Area Loan_Status
 0: 80          Rural    :149   N:164      
 1:431          Semiurban:197   Y:347      
                Urban    :165              
                                           
                                           
                                           

5.METHODOLOGY

Since our response variable is binary, taking only the values 0 and 1, we use a logistic regression model.

5.1.Logistic Regression

The logistic regression model is ,

log \(\frac{π(x)}{1 − π(x)}\)= α + βx.

Solving for π, this gives

π(x) =\(\frac{e^{α+βx}}{1 + e^{α+βx}}\).

If we want class prediction, we should predict Y = 1 when π ≥ 0.5 and Y = 0 when π < 0.5. This means guessing 1 whenever α + βx is non-negative, and 0 otherwise.

If we have \(p\) covariates \(x_1\), \(x_2\), …, \(x_p\) and a single response variable y, then we can fit a logistic model

π(x) = \(\frac{e^{β_0+β_1x_1+β_2x_2+....+β_px_p}}{1 +e^{β_0+β_1x_1+β_2x_2+....+β_px_p}}\).

Here π(x) represents the probability π(x) = P(Y = 1|\(x_1\), \(x_2\), …, \(x_p\)). The variables \(x_1\), \(x_2\), …., \(x_p\). on the other hand can be either quantitative or qualitative. The parameters \(β_i\) , i = 1, 2, .., p signifies the effect of \(x_i\) on the probability π(x). The interpretation of the intercept term \(β_0\) remains exactly the same as in case of single covariate.

If any of our covariates is a factor (or categorical variable), then we write the model in the same way as we write in case of linear models: including dummy variables. More specifically if we have a factor covariate A with k levels \(A_1\), \(A_2\), …, \(A_k\) having potential effect on the response y, then we use \(k−1\) indicator variables or dummy variables \(x_1\),\(x_2\), …, \(x_{k−1}\) where \[x_i =\begin{cases} 1 & \text{if the observation receives the $i^{th}$ level $A_i$}\\ 0 & \text{otherwise} \end{cases}\]

Here the \(k^{th}\) level \(A_k\) is the baseline level or reference level compared to which we measure the effects of other levels. Obviously we can take any one of the k levels as baseline and compare other levels with it.So finally we have (p-1)+(k-1) covariates in our model.


Call:  glm(formula = Loan_Status ~ Property_Area + Credit_History + 
    Loan_Amount_Term + LoanAmount + CoapplicantIncome + ApplicantIncome + 
    Self_Employed + Education + Dependents + Married + Gender, 
    family = "binomial", data = loan)

Coefficients:
           (Intercept)  Property_AreaSemiurban      Property_AreaUrban  
            -2.636e+00               9.355e-01               5.715e-02  
       Credit_History1        Loan_Amount_Term              LoanAmount  
             3.732e+00              -2.280e-04              -2.614e-03  
     CoapplicantIncome         ApplicantIncome        Self_EmployedYes  
            -4.823e-05              -1.295e-06              -5.559e-02  
 EducationNot Graduate             Dependents1             Dependents2  
            -4.493e-01              -2.744e-01               2.539e-01  
          Dependents3+              MarriedYes              GenderMale  
             7.942e-02               5.463e-01               2.504e-01  

Degrees of Freedom: 510 Total (i.e. Null);  496 Residual
Null Deviance:      641.4 
Residual Deviance: 465  AIC: 495

• Our model includes 11 covariates, 7 of which are factor variables. We introduce dummy variables for each of these factor variables. As a result, the total number of covariates in the model increases to 14.

5.2.Detection of Influential Points

A data point that unduly influences the regression analyses outputs. A point is considered influential if its exclusion causes major changes in the fitted regression function. Depending on the location of the point, it may affect all statistics, including the p-value, r-square, coefficients, and intercept.

5.2.1.Cook’s Distance

A measure of detecting influential point is Cook’s distance

\(CD_i =\frac{h_{ii}}{1-h_{ii}}(\frac{e_i^{(s)^2}}{p+1})\)

If,\(CD_i\) > \(\frac{4}{(n-p-1)}\) then the \(i^{th}\) data point can be designated as as influential point.

n : total number of observations

p : number of variables

\(h_{ii}\) : \(h_{ii}\) is the leverage, i.e., the \(i^{th}\) diagonal element of the hat matrix \(X(X^TX)^{−1}X^T\) , X is the matrix of covariates.

\(e_i^{(s)}\) = \(\frac{e_i}{s_{(i)}\sqrt{1-h_{ii}}}\) ,where \(s_{(i)}^2\) = \(\frac{RSS_{(i)}}{n-p-2}\) and \({RSS_{(i)}}\) = residual sum of square obtained by running least square method using the data-set excluding the \(i^{th}\) point.

Interpretation

The plot shows Cook’s distance values on the y-axis for each observation . Larger Cook’s distance values indicate observations that have a greater influence on the model’s coefficients. In this plot, observations 156, 184, and 582 stand out with noticeably higher Cook’s distance values than the others, suggesting these points are influential.

[1] 41

There are 41 influential points.

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

5.3.Summary of the model


Call:
glm(formula = Loan_Status ~ Property_Area + Credit_History + 
    Loan_Amount_Term + LoanAmount + CoapplicantIncome + ApplicantIncome + 
    Self_Employed + Education + Dependents + Married + Gender, 
    family = "binomial", data = loan)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -3.517e+01  2.144e+03  -0.016 0.986917    
Property_AreaSemiurban  1.608e+00  4.401e-01   3.655 0.000257 ***
Property_AreaUrban      2.065e-01  3.641e-01   0.567 0.570657    
Credit_History1         3.851e+01  2.144e+03   0.018 0.985673    
Loan_Amount_Term       -6.100e-03  4.309e-03  -1.416 0.156916    
LoanAmount             -7.821e-03  3.435e-03  -2.277 0.022780 *  
CoapplicantIncome       2.639e-04  1.335e-04   1.976 0.048151 *  
ApplicantIncome         6.891e-05  8.427e-05   0.818 0.413564    
Self_EmployedYes        2.336e+00  1.061e+00   2.202 0.027684 *  
EducationNot Graduate  -4.188e-01  3.940e-01  -1.063 0.287811    
Dependents1             4.050e-01  4.946e-01   0.819 0.412870    
Dependents2             6.917e-01  5.152e-01   1.343 0.179367    
Dependents3+            1.764e+01  1.484e+03   0.012 0.990513    
MarriedYes              4.618e-01  3.753e-01   1.230 0.218534    
GenderMale             -6.978e-02  4.294e-01  -0.163 0.870899    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 554.33  on 469  degrees of freedom
Residual deviance: 269.67  on 455  degrees of freedom
AIC: 299.67

Number of Fisher Scoring iterations: 19

Interpretation

At the 5% level of significance, we can conclude that 4 covariates are statistically significant i.e. their p-values are less than 0.05.

• When other variables remain fixed , for every one unit increase in LoanAmount , the log odds of Loan_Status decrease by 7.821e-03.

• When other variables remain fixed , for every one unit increase in CoapplicantIncome , the log odds of Loan_Status increase by 2.639e-04.

• Compared to having a property in a rural area, having a property in a semiurban area increases the log odds of Loan_Status by 1.608.

• Compared to a non-self-employed applicant, being self-employed increases the log odds of Loan_Status by 2.336.

To improve the fit of our model we use AIC backward method.

5.4.Backward elimination based on the Akaike Information Criterion(AIC)

Backward stepwise regression removes predictors based on their significance. At each step, compute the AIC for the current model. Select the model with the lowest AIC value, as it represents a good balance between model fit and simplicity (fewer parameters). The process starts with all predictors in the model and iteratively removes the least significant predictor, one at a time, until all remaining predictors are significant.

ResourceSelection 0.3-6      2023-06-27
Start:  AIC=299.67
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term + 
    LoanAmount + CoapplicantIncome + ApplicantIncome + Self_Employed + 
    Education + Dependents + Married + Gender
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
                    Df Deviance    AIC
- Gender             1   269.70 297.70
- ApplicantIncome    1   270.45 298.45
- Education          1   270.77 298.77
- Married            1   271.19 299.19
<none>                   269.67 299.67
- Loan_Amount_Term   1   272.14 300.14
- CoapplicantIncome  1   274.16 302.16
- LoanAmount         1   275.66 303.66
- Dependents         3   280.85 304.85
- Self_Employed      1   278.66 306.66
- Property_Area      2   287.54 313.54
- Credit_History     1   519.50 547.50
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Step:  AIC=297.7
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term + 
    LoanAmount + CoapplicantIncome + ApplicantIncome + Self_Employed + 
    Education + Dependents + Married
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
                    Df Deviance    AIC
- ApplicantIncome    1   270.47 296.47
- Education          1   270.83 296.83
- Married            1   271.23 297.23
<none>                   269.70 297.70
- Loan_Amount_Term   1   272.14 298.14
- CoapplicantIncome  1   274.22 300.22
- LoanAmount         1   275.74 301.74
- Dependents         3   280.86 302.86
- Self_Employed      1   278.72 304.72
- Property_Area      2   287.96 311.96
- Credit_History     1   519.86 545.86
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Step:  AIC=296.47
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term + 
    LoanAmount + CoapplicantIncome + Self_Employed + Education + 
    Dependents + Married
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
                    Df Deviance    AIC
- Married            1   271.83 295.83
- Education          1   271.91 295.91
<none>                   270.47 296.47
- Loan_Amount_Term   1   273.47 297.47
- CoapplicantIncome  1   274.23 298.23
- LoanAmount         1   276.61 300.61
- Dependents         3   281.78 301.78
- Self_Employed      1   281.28 305.28
- Property_Area      2   288.50 310.50
- Credit_History     1   519.86 543.86
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Step:  AIC=295.83
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term + 
    LoanAmount + CoapplicantIncome + Self_Employed + Education + 
    Dependents
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
                    Df Deviance    AIC
- Education          1   273.31 295.31
<none>                   271.83 295.83
- Loan_Amount_Term   1   275.59 297.59
- CoapplicantIncome  1   276.79 298.79
- LoanAmount         1   277.41 299.41
- Self_Employed      1   282.79 304.79
- Dependents         3   287.06 305.06
- Property_Area      2   290.75 310.75
- Credit_History     1   522.55 544.55
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Step:  AIC=295.31
Loan_Status ~ Property_Area + Credit_History + Loan_Amount_Term + 
    LoanAmount + CoapplicantIncome + Self_Employed + Dependents
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
                    Df Deviance    AIC
<none>                   273.31 295.31
- Loan_Amount_Term   1   276.53 296.53
- LoanAmount         1   277.99 297.99
- CoapplicantIncome  1   278.65 298.65
- Self_Employed      1   284.02 304.02
- Dependents         3   288.28 304.28
- Property_Area      2   293.69 311.69
- Credit_History     1   526.74 546.74

Interpretation

We selected a model with 7 important covariates: Loan_Amount_Term, LoanAmount, CoapplicantIncome, Self_Employed, Dependents, Property_Area, and Credit_History, which has an AIC of 295.31, indicates that this model is relatively efficient.

5.5.Residual vs Fitted Plot

Interpretation

The residuals above 0 are tightly packed, indicating that the model predicts those values reasonably well. However, there is a noticeable spread of residuals below 0, particularly in the range between -1 and -4, suggesting that the model is under-predicting for those specific cases. There are some points that lie far from the rest of the data, particularly those below -2. These points might be considered outliers and could indicate instances where the model performs poorly. The fact that most residuals are centered around 0 and are relatively evenly distributed suggests that the model has captured most of the variance in the data reasonably well.

5.6.Half-Normal Probability (hnp) Plot

A half-normal probability plot of the deviance residuals with a simulated envelope is useful both for examining the adequacy of the linear part of the logistic regression model and for identifying deviance residuals that are outlying. A half-normal probability plot helps to highlight outlying deviance residuals even though the residuals are not normally distributed. In a normal probability plot, the kth ordered residual is plotted against the percentile z [(k − 0.375) / (n + 0.25)] or against √MSE times this percentile. In a half-normal probability plot, the kth ordered absolute residual is plotted against:

z \((\frac{k + n − 1/8} {2n + 1/2})\)

Outliers will appear at the top right of a half-normal probability plot as points separated from the others. However, a half-normal plot of the absolute residuals will not necessarily give a straight line even when the fitted model is in fact correct. To identify outlying deviance residuals, we combine a half-normal probability plot with a simulated envelope This envelope constitutes a band such that the plotted residuals are all likely to fall within the band if the fitted model is correct.

Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:patchwork':

    area
Binomial model 

Interpretation

The plot includes confidence bands around the line (shown as dotted lines). These bands represent the expected range of variation under the fitted model. Most of the residuals fall within these confidence bands, suggesting that the model is capturing the majority of the variability in the data accurately. There are some deviations from the line, particularly at the higher quantiles (above approximately 2.0 on the x-axis), where the residuals tend to rise above the line. This deviation might indicate that the model slightly underestimates the higher values or that there could be some issues with the tails of the distribution that are not being captured well by the model.

5.7.Confusion Matrix,Accuracy,TPR,FPR,TNR,FNR,Precision,F Score

It is a useful tool in classification problems to evaluate the performance of a classification model. It provides a summary of prediction results by comparing the predicted classifications against the actual values. Components of a Confusion Matrix In a binary classification problem, the confusion matrix is typically a 2x2 matrix with the following entries: True Positives (TP): The number of instances correctly predicted as positive. True Negatives (TN): The number of instances correctly predicted as negative. False Positives (FP): The number of instances incorrectly predicted as positive. False Negatives (FN): The number of instances incorrectly predicted as negative.

Accuracy – It determines the overall predicted accuracy of the model. It is calculated as

Accuracy = \(\frac{True \, Positives+True \, Negatives}{True\, Positives+True\, Negatives+False\, Positives+False\, Negatives}\)

True Positive Rate (TPR) – It indicates how many positive values,out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is

TPR = \(\frac{True\,Positives}{True\, Positives+False\, Negatives}\)

It is also known as Sensitivity or Recall.

False Positive Rate (FPR) – It indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is

FPR = \(\frac{False\,Positives}{True\, Negatives+False\, Positives}\)

True Negative Rate (TNR) – It indicates how many negative values,out of all the negative values, have been correctly predicted. The formula to calculate the true negative rate is

TNR = \(\frac{True\, Negatives}{True\, Negatives+False\, Positives}\)

It is also known as Specificity.

False Negative Rate (FNR) – It indicates how many positive values,out of all the positive values, have been incorrectly predicted. The formula to calculate false negative rate is

FNR = \(\frac{False\, Negatives}{True\, Positives+False\, Negatives}\)

Precision: It indicates how many values, out of all the predicted positive values, are actually positive. It is formulated as:

Precision = \(\frac{True\, Positives}{True\, Positives+False\, Positives}\)

F Score : F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. It is formulated as

\(\frac{2*(precision*recall) } {(precision+recall)}\).

Confusion Matrix

      Predicted
Actual   0   1
     0  74  56
     1   0 340

Accuracy-

[1] 0.8808511

The model is accurate, correctly classifying approximately 88.09% of the total instances.

True Positive Rate (TPR)

[1] 1

The model identifies all actual positive cases correctly, with no false negatives. This means that every positive instance is correctly classified as positive.

False Positive Rate (FPR) -

[1] 0.4307692

About 43.08% of the actual negative cases are incorrectly classified as positive.

True Negative Rate (TNR)

[1] 0.5692308

The model correctly identifies around 56.92% of the actual negative cases.

False Negative Rate (FNR)

[1] 0

There are no false negatives, meaning the model does not miss any actual positive cases.

Precision -

[1] 1

F Score -

[1] 1

A value of 1.0 indicates that the model is perfect in handling positive cases.

5.8.Receiver Operator Characteristic (ROC) Curve

Area Under the Curve (AUC): The AUC quantifies the overall ability of the model to discriminate between positive and negative classes. AUC ranges from 0 to 1, where a value of 0.5 indicates no discriminative power (similar to random guessing), and a value of 1 indicates perfect discrimination.

ROC determines the accuracy of a classification model at a user defined threshold value. It determines the model’s accuracy using Area Under Curve (AUC).The area under the curve (AUC), also referred to as index of accuracy (A) or concordant index, represents the performance of the ROC curve. Higher the area, better the model. ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).

Interpretation

The curve is close to the upper left corner, which indicates a good model, as it maximizes the TPR while minimizing the FPR. The AUC value is 0.90108, which is a high score.This suggests that the model has a very good ability to distinguish between the positive and negative classes. A perfect model would have an AUC of 1.The ROC curve and the AUC value indicate that the model has good predictive performance.

5.9.Group Lasso

At first we create a feature matrix where the categorical features are converted to numeric with one-hot encoding (It transforms categorical data into a binary matrix representation where each category is represented by a unique binary column.) We remove the first column of the dummies to avoid multicollinearity.

Then we the drop reference dummies and combine dummies with numeric variables then convert them to a matrix X.The outcome variable is put into a vector called Y.

Finally we create group vector that distinguish groups of predictors in the data. It separates the predictors into different groups based on some criteria such as the type of predictor or the relationship between predictors. This step is necessary for the group LASSO step, as it allows for the lasso to apply different penalties to different groups of predictors.


Attaching package: 'dplyr'
The following object is masked from 'package:MASS':

    select
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator. This particular type of regression is well-suited for models showing high levels of muticollinearity. It performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

Group lasso is a variable selection method in linear regression models that extends the lasso to select variables in groups.In situations where features are naturally grouped (e.g., polynomial features, categorical variables converted into dummy variables, or genetic data grouped by genes), it might be desirable to either select or discard an entire group of variables together. Mathematical Formulation: The model for Group Lasso is:

\(\underset{\beta}{min} (\frac{1}{2n}\sum_{i=1}^{n}(y_i−\sum_{𝑗=1}^{𝑝}𝑋_{ij}𝛽_𝑗)^2+\lambda \sum_{𝑔=1}^{𝐺}\sqrt{𝑝_g}∥\beta_𝑔∥_2)\)

where:

• n is the number of observations.

\(y_i\) is the response variable for the i-th observation.

\(X_{ij}\) is the value of the j-th predictor for the i-th observation.

\(β_j\) is the coefficient for the j-th predictor.

• λ is the regularization parameter that controls the strength of the penalty.

\(β_g\) is the vector of coefficients for the g-th group of predictors.

\(∥𝛽_𝑔∥_2\) represents the Euclidean (L2) norm of the coefficients in the g-th group.

\(p_g\) is the number of predictors in the g-th group.


Attaching package: 'grpreg'
The following object is masked from 'package:dplyr':

    select
The following object is masked from 'package:MASS':

    select
Logistic regression modeling Pr(y=Male)

5.9.2.Cross Validation

Cross validation is a technique used in machine learning to evaluate the performance of a model on unseen data. It involves dividing the available data into multiple folds or subsets, using one of these folds as a validation set, and training the model on the remaining folds. This process is repeated multiple times, each time using a different fold as the validation set. Finally, the results from each validation step are averaged to produce a more robust estimate of the model’s performance.

Logistic regression modeling Pr(y=Male)
grLasso-penalized logistic regression with n=470, p=14
At minimum cross-validation error (lambda=0.0112):
-------------------------------------------------
  Nonzero coefficients: 10
  Nonzero groups: 7
  Cross-validation error of 0.78
  Maximum R-squared: 0.14
  Maximum signal-to-noise ratio: 0.19
  Prediction error at lambda.min: 0.183

Minimum lambda value(the value of lambda that gives the lowest mean cross-validated error)

[1] 0.01115884
            (Intercept)         ApplicantIncome       CoapplicantIncome 
           1.2882341569            0.0000000000            0.0002796523 
             LoanAmount        Loan_Amount_Term             Married_Yes 
           0.0000000000           -0.0021408138            1.4390024087 
           Dependents_1            Dependents_2           Dependents_3+ 
          -0.1435232569            0.7378090264            0.5760988879 
 Education_Not Graduate       Self_Employed_Yes        Credit_History_1 
           0.0455341745            0.0000000000            0.0093371183 
Property_Area_Semiurban     Property_Area_Urban           Loan_Status_Y 
          -0.3770140832           -0.0613556516            0.0000000000 

We can see that there are 7 effective covariates: CoapplicantIncome, Loan_Amount_Term,Married,Dependents,Education,Credit_History, and Property_Area.

We fit a logistic regression model using these 7 covariates.

Confusion Matrix

      Predicted
Actual   0   1
     0  73  57
     1   0 340

Receiver Operator Characteristic (ROC) Curve and Area Under the Curve

The model we fit using the AIC backward method included 7 covariates : CoapplicantIncome , LoanAmount , Self_Employed , Loan_Amount_Term,Dependents , Credit_History , and Property_Area. We checked the model’s prediction performance and found that the AUC value was 0.9010. However, using the group lasso method, we also obtained a model with 7 covariates, but this model included ‘Married’ and ‘Education’ instead of ‘LoanAmount’ and ‘Self_Employed’.We found AUC for this model 0.8898. Although the second model has a good AUC value, we decided to use the first model for predicting loan status.

6.For the dataset provided below, we want to predict the loan approval status

data-set has 12 rows and 367 columns

6.1.Data Manipulation

          Loan_ID            Gender           Married        Dependents 
                0                 0                 0                 0 
        Education     Self_Employed   ApplicantIncome CoapplicantIncome 
                0                 0                 0                 0 
       LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
                5                 6                29                 0 

• We have 40 NA(not available) values in the data-set: 29 from Credit_History,5 from LoanAmount,6 from Loan_Amount_Term.For the NA values, we removed the rows containing them.

• There are 41 empty cells, so we removed the rows containing these empty cells.

• Converting Character Variables to Factors:We have 8 character variables (like Gender, Married, etc.) that we have converted to factors. to factor variables. This operation does not change the dimensions of the data-set.

• Final Data-set Dimensions: After cleaning steps, you have a data-set of 11 rows and 298 columns.

[1] "Empty Cells"
[1] 41
'data.frame':   298 obs. of  11 variables:
 $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 2 2 2 1 ...
 $ Married          : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 1 ...
 $ Dependents       : Factor w/ 4 levels "0","1","2","3+": 1 2 3 1 1 2 3 1 1 1 ...
 $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 2 2 2 2 2 1 ...
 $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
 $ ApplicantIncome  : int  5720 3076 5000 3276 2165 2226 3881 2400 3091 4666 ...
 $ CoapplicantIncome: int  0 1500 1800 0 3422 0 0 2400 0 0 ...
 $ LoanAmount       : num  110 126 208 78 152 59 147 123 90 124 ...
 $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
 $ Credit_History   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
 $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 2 1 2 3 2 ...

The following table represents Loan_IDs and their corresponding loan approval status.

7.CONCLUSION

In this project, we aimed to predict the Loan Status of applicants.The response variable is categorical with two possible outcomes,so we use logistic regression. We identify and remove influential points to improve model accuracy. To determine the most relevant predictors, we employed the AIC backward method, which selected seven covariates: CoapplicantIncome , Loan_Amount , Self_Employed , Loan_Amount_Term , Dependents , Credit_History and Property_Area . The model AIC is 295.31.

Then we applied the group Lasso to the data-set, which also selected seven covariates. However, this method identified “Married” and “Education” as significant predictors, instead of “Loan_Amount and” Self_Employed”. After comparing the predictive performance of both models, the first model was found to be more effective.

Finally, we used the first model to predict the Loan Status of applicants on a new data-set.

8. APPENDIX

Here are the R codes used for this project.

library("rmarkdown")
lo=read.csv("C:\\Users\\HP\\Downloads\\loan-train (1).csv")
paged_table(lo)

colSums(is.na(lo))

LoanAmount.mean=mean(lo$LoanAmount,na.rm=TRUE)
lo$LoanAmount=replace(lo$LoanAmount,is.na(lo$LoanAmount)==1,LoanAmount.mean)
Loan_Amount_Term.mean=mean(lo$Loan_Amount_Term,na.rm=TRUE)
lo$Loan_Amount_Term=replace(lo$Loan_Amount_Term,is.na(lo$Loan_Amount_Term)==1,Loan_Amount_Term.mean)
loan=na.omit(lo)#remove NA
print("Empty Cells")
sum(loan=="")
index1=which(loan$Gender=="") # remove empty
index2=which(loan$Self_Employed=="")
index3=which(loan$Married=="")
index4=which(loan$Dependents=="")
loan=loan[-c(index1,index2,index3,index4),]
loan=loan[,-1]#remove 1st column
loan$Gender=as.factor(loan$Gender)#factor
loan$Married=as.factor(loan$Married)
loan$Dependents=as.factor(loan$Dependents)
loan$Education=as.factor(loan$Education)
loan$Self_Employed=as.factor(loan$Self_Employed)
loan$Credit_History =as.factor(loan$Credit_History)
loan$Property_Area =as.factor(loan$Property_Area)
loan$Loan_Status=as.factor(loan$Loan_Status)
str(loan)

library("ggplot2")
library("patchwork")
# gender
data1=data.frame("cat1"=c("Female","Male"),"val1"=c(sum(loan$Gender=="Female"),sum(loan$Gender=="Male")))
slices1=c(sum(loan$Gender=="Female"),sum(loan$Gender=="Male"))
frac1=(slices1/sum(slices1))
ymax1=cumsum(frac1)
ymin1=c(0,head(ymax1,n=-1))
labposi1=(ymax1+ymin1)/2
labls1=paste0(c("Female","Male"),"\n value:",paste(round(frac1*100)),"%",sep="")
y1=ggplot(data1,aes(ymax=ymax1,ymin=ymin1,xmax=4,xmin=3,fill=cat1))+geom_rect()+geom_label(x=3.5,aes(y=labposi1,label=labls1),size=3)+coord_polar(theta="y")+xlim(c(2,4))+theme_void()+theme(legend.position = "none")+labs(title=" Gender")+scale_fill_manual(values=c("yellow","purple"))
# married
data2=data.frame("cat2"=c("Yes","No"),"val2"=c(sum(loan$Married=="Yes"),sum(loan$Married=="No")))
slices2=c(sum(loan$Married=="Yes"),sum(loan$Married=="No"))
frac2=(slices2/sum(slices2))
ymax2=cumsum(frac2)
ymin2=c(0,head(ymax2,n=-1))
labposi2=(ymax2+ymin2)/2
labls2=paste0(c("Yes","No"),"\n value:",paste(round(frac2*100)),"%",sep="")
y2=ggplot(data2,aes(ymax=ymax2,ymin=ymin2,xmax=4,xmin=3,fill=cat2))+geom_rect()+geom_label(x=3.5,aes(y=labposi2,label=labls2),size=3,color="white")+coord_polar(theta="y")+xlim(c(2,4))+theme_void()+theme(legend.position = "none")+labs(title=" Marital Status")+scale_fill_manual(values = c("orange","blue"))

y1+y2

# dependents
data=data.frame("Category"=c("0","1","2","+3"),"values"=c(sum(loan$Dependents=="0"),sum(loan$Dependents=="1"),sum(loan$Dependents=="2"),sum(loan$Dependents=="3+")))
y3=ggplot(data,aes(x=Category,y=values,fill=Category))+geom_bar(stat="identity")+geom_text(aes(label=values),vjust=1.6,color="white")+scale_fill_manual(values=c("green","#993300","#666666","#9933FF"))+scale_x_discrete(limits =c("0","1","2","+3"))+labs(title=" Dependents",x="",y="Count")

#graduate
data3=data.frame("cat3"=c("Graduate","Not Graduate"),"val3"=c(sum(loan$Education=="Graduate"),sum(loan$Education=="Not Graduate")))
slices3=c(sum(loan$Education=="Graduate"),sum(loan$Education=="Not Graduate"))
frac3=(slices3/sum(slices3))
ymax3=cumsum(frac3)
ymin3=c(0,head(ymax3,n=-1))
labposi3=(ymax3+ymin3)/2
labls3=paste0(c("Graduate","Not Graduate"),"\n value:",paste(round(frac3*100)),"%",sep="")
y4=ggplot(data3,aes(ymax=ymax3,ymin=ymin3,xmax=4,xmin=3,fill=cat3))+geom_rect()+geom_label(x=3.5,aes(y=labposi3,label=labls3),size=3)+coord_polar(theta="y")+xlim(c(2,4))+theme_void()+theme(legend.position = "none")+labs(title=" Graduation Status")+scale_fill_manual(values=c("#99CC99","#CC0000"))
y3+y4

p1=ggplot(data=loan,aes(x=ApplicantIncome))+geom_histogram(aes(y=..density..),bins=24,col="black",fill="#00FF99")+labs(title="Applicant Income")+geom_density()
p2=ggplot(data=loan,aes(x=CoapplicantIncome))+geom_histogram(aes(y=..density..),bins=24,col="black",fill="#66CCFF")+labs(title="Coapplicant Income")+geom_density()


p1+p2

p3=ggplot(data=loan,aes(x=LoanAmount))+geom_histogram(aes(y=..density..),bins=24,col="black",fill="#FFFF99")+labs(title="Loan Amount")+geom_density()
p4=ggplot(data=loan,aes(x=Loan_Amount_Term))+geom_histogram(aes(y=..density..),bins=24,col="black",fill="#CCCCCC")+labs(title="Loan Amount In Term")+geom_density()
p3+p4

#self employed
data6=data.frame("Category"=c("Yes","No"),"Values6"=c(sum(loan$ Self_Employed =="Yes"),sum(loan$Self_Employed=="No")))
q1=ggplot(data=data6,aes(x="",y=Values6,fill=Category))+geom_col(color="black")+geom_label(aes(label=paste(round(Values6/sum(Values6)*100),"%",sep="")),position=position_stack(vjust=0.5),show.legend=FALSE)+coord_polar(theta="y")+theme_void()+labs(title="Self Employment Status")+scale_fill_manual(values=c("#FFFF99","#CC0066"))

#credit history
data=data.frame("Category"=c("1","0"),"Values"=c(sum(loan$Credit_History =="1"),sum(loan$Credit_History=="0")))
q2=ggplot(data=data, aes(x=Category,y=Values,fill=Category))+geom_bar(stat="identity")+scale_fill_manual(values=c("#996633","#660066"))+labs(title=" Credit History")+coord_flip()+geom_text(aes(label=Values),hjust=1.6,color="white")


q1+q2

#Property Area
data=data.frame("Category"=c("Rural","Semiurban","Urban"),"Values"=c(sum(loan$Property_Area=="Rural"),sum(loan$Property_Area=="Semiurban"),sum(loan$Property_Area=="Urban")))
q3=ggplot(data=data,aes(x=Category,y=Values,fill=Category))+geom_bar(stat="identity")+scale_fill_manual(values=c("#66CC66","#006699","#FF3333"))+labs(title="Property Area")+geom_text(aes(label=Values),vjust=1.6,color="white")
#loan status
data7=data.frame("Category"=c("Yes","No"),"Values7"=c(sum(loan$ Loan_Status=="Y"),sum(loan$Loan_Status=="N")))
q4=ggplot(data=data7,aes(x="",y=Values7,fill=Category))+geom_col(color="black")+geom_label(aes(label=paste(round(Values7/sum(Values7)*100),"%",sep="")),position=position_stack(vjust=0.5),show.legend=FALSE)+coord_polar(theta="y")+theme_void()+labs(title="Loan Status ")+scale_fill_manual(values=c("#9999FF","#996633"))
q3+q4

summary(loan)

#logistic
model=glm( Loan_Status~Property_Area+Credit_History+Loan_Amount_Term +LoanAmount+CoapplicantIncome+ ApplicantIncome+Self_Employed +Education + Dependents +Married +Gender,family="binomial",data=loan)
model

plot(model,4)

influential=(cooks.distance(model)>(4/(511-12-1)))
sum(influential)
loan=subset(loan,subset=(influential==FALSE))

model2=glm( Loan_Status~Property_Area+Credit_History+Loan_Amount_Term +LoanAmount+CoapplicantIncome+ ApplicantIncome+Self_Employed +Education + Dependents +Married +Gender,family="binomial",data=loan)

summary(model2)

library("ResourceSelection")
model3=step(model2,direction="backward")

plot(residuals(model3,"pearson"),main="Residuals Plot",xlab="Fitted Values",ylab="Pearson Residuals")

library("hnp")
hnp(model3)

library("ROCR")
pre=ifelse(fitted(model3)>0.5,1,0)
status=ifelse(loan$Loan_Status=="N",0,1)
tab=table(Predicted=pre,Actual=status)
tab=t(tab)
tab

sum(diag(tab))/sum(tab)

re=tab[2,2]/(tab[2,2]+tab[2,1])
re

tab[1,2]/(tab[1,1]+tab[1,2])

tab[1,1]/(tab[1,1]+tab[1,2])

tab[2,1]/(tab[2,2]+tab[2,1])

pe=tab[2,2]/(tab[2,1]+tab[2,2])
pe

(2*(re*pe))/(re+pe)

pred=predict(model3,type="response",newdata=loan)
pred1=prediction(pred,loan$Loan_Status)
pref=performance(pred1,"tpr","fpr")
plot(pref,print.cutoffs.at=seq(0,1,0.1),colorize=TRUE) 
abline(a=0,b=1)
auc=performance(pred1,"auc")@y.values[[1]]
legend(0.6,0.4,auc,title="AUC",cex=0.9)

library("fastDummies")
library("dplyr")
#The function "select_if" is used to select only the factor variables in the dataframe. The function "dummy_cols" is then used to create new columns for each level of the factors variables, with the levels being represented by binary variables (0 or 1). The option "remove_first_dummy" is set to true so that one level of each factor variable is removed to prevent collinearity issues. The option "remove_selected_columns" is also set to true so that the original factor columns are removed from the dataframe. Finally, the option "select_columns" is set to the names of the factor columns.
d_factor2=select_if(loan,is.factor)
dummies=dummy_cols(d_factor2,remove_first_dummy=TRUE,remove_selected_columns=TRUE,select_columns=colnames(d_factor2))
dummies=dummies[,-1]
paged_table(dummies)

d2 =select_if(loan, is.numeric)
X = cbind(d2, dummies)
X = as.matrix(X)
d_factor2$Loan_Status = ifelse(d_factor2$Loan_Status == "Y", 1, 0)
y = d_factor2[,1]
group = dummies
colnames(group) = sub("_[^_]+$", "", colnames(group))
group = cbind(d2, group)
group = colnames(group)

library("grpreg")
fit=grpreg(X,y,group,penalty="grLasso",family="binomial")

cvfit=cv.grpreg(X,y,group,penalty="grLasso",family="binomial")
summary(cvfit)
plot(cvfit)

cvfit$lambda.min

coef(fit,lambda=cvfit$lambda.min)

model4=glm(Loan_Status~Property_Area+Credit_History+Loan_Amount_Term +CoapplicantIncome+Education+Dependents+Married,family="binomial",data=loan)

pre=ifelse(fitted(model4)>0.5,1,0)
status=ifelse(loan$Loan_Status=="N",0,1)
tab=table(Predicted=pre,Actual=status)
tab=t(tab)
tab

pred=predict(model4,type="response",newdata=loan)
pred1=prediction(pred,loan$Loan_Status)
pref=performance(pred1,"tpr","fpr")
plot(pref,print.cutoffs.at=seq(0,1,0.1),colorize=TRUE) 
abline(a=0,b=1)
auc=performance(pred1,"auc")@y.values[[1]]
legend(0.6,0.4,auc,title="AUC",cex=0.9)

test=read.csv("C:\\Users\\HP\\OneDrive\\Documents\\lone_test.csv")
paged_table(test)

colSums(is.na(test))

test=read.csv("C:\\Users\\HP\\OneDrive\\Documents\\lone_test.csv")
LoanAmount.mean=mean(test$LoanAmount,na.rm=TRUE)
test$LoanAmount=replace(test$LoanAmount,is.na(test$LoanAmount)==1,LoanAmount.mean)
Loan_Amount_Term.mean=mean(test$Loan_Amount_Term,na.rm=TRUE)
test$Loan_Amount_Term=replace(test$Loan_Amount_Term,is.na(test$Loan_Amount_Term)==1,Loan_Amount_Term.mean)
test=na.omit(test)
print("Empty Cells")
sum(test=="")
index1=which(test$Gender=="") # remove empty
index2=which(test$Self_Employed=="")
index3=which(test$Married=="")
index4=which(test$Dependents=="")
test=test[-c(index1,index2,index3,index4),]

test$Gender=as.factor(test$Gender)#factor
test$Married=as.factor(test$Married)
test$Dependents=as.factor(test$Dependents)
test$Education=as.factor(test$Education)
test$Self_Employed=as.factor(test$Self_Employed)
test$Credit_History =as.factor(test$Credit_History)
test$Property_Area =as.factor(test$Property_Area)
te=test[,-1]
str(te)
predi=predict(model3,te)
val=ifelse((predi)>0.5,"YES","NO")

o=data.frame("Loan_Id"=test[,1],"Loan_Status"=val)
paged_table(o)