LOAD DATA INTO ENVIRONMENT

The na.strings argument is for substitution within the body of the file, that is, matching strings that should be replaced with NA.

cr<-read.csv("CreditRiskData.csv",header = T,na.strings = c("","","NA"))
summary(cr)
##      Loan_ID       Gender    Married    Dependents        Education  
##  LP001002:  1   Female:112   No  :213   0   :345   Graduate    :480  
##  LP001003:  1   Male  :489   Yes :398   1   :102   Not Graduate:134  
##  LP001005:  1   NA's  : 13   NA's:  3   2   :101                     
##  LP001006:  1                           3+  : 51                     
##  LP001008:  1                           NA's: 15                     
##  LP001011:  1                                                        
##  (Other) :608                                                        
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##  No  :500      Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Yes : 82      1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  NA's: 32      Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50
class(cr$Credit_History)
## [1] "integer"
str(cr)
## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Dependents       : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

How to handle missing values? 1.Checkf if column has more than 50% missing values. If yes then remove that column. Other scenario where we could remove the column is when it is not essential for our analysis.

colSums(is.na(cr))
##           Loan_ID            Gender           Married        Dependents 
##                 0                13                 3                15 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                32                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0
cr$Loan_ID<-NULL
  1. Check data type / class of columns. Convert from one data type to other as per the case.
str(cr)
## 'data.frame':    614 obs. of  12 variables:
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Dependents       : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
class(cr$Credit_History)
## [1] "integer"
cr$Credit_History<-as.factor(cr$Credit_History)
  1. If variable has numeric data type and is normally distributed, then impute with mean.

Let’s check if LoanAmount is normally distributed. If it is, impute missing values with mean(cr$LoanAmount)

shapiro.test(cr$LoanAmount)
## 
##  Shapiro-Wilk normality test
## 
## data:  cr$LoanAmount
## W = 0.76752, p-value < 2.2e-16
  1. If variable has numeric data type and is not normally distributed then impute with median.
    Let’s check which values are NA within cr$LoanAmount
which(is.na(cr$LoanAmount))
##  [1]   1  36  64  82  96 103 104 114 128 203 285 306 323 339 388 436 438
## [18] 480 525 551 552 606

LoanAmount is not normally distributed. So we use median to replace NA

cr$LoanAmount[is.na(cr$LoanAmount)]<-median(cr$LoanAmount,na.rm = T)

Let’s re-check if there are any NA within cr$LoanAmount.

which(is.na(cr$LoanAmount))
## integer(0)

We can also check for previous specific values replaced with median value. ## [1] 1 36 64 82 96 103 ...

cr$LoanAmount[c(1,36,64,82)]
## [1] 128 128 128 128
  1. If variable type is categorical then impute with mode.
    Let’s analyse cr$Credit_History. Let us count the number of NA number of complete values & locate `NA
sum(is.na(cr$Credit_History))
## [1] 50
sum(!is.na(cr$Credit_History))
## [1] 564
which(is.na(cr$Credit_History))
##  [1]  17  25  31  43  80  84  87  96 118 126 130 131 157 182 188 199 220
## [18] 237 238 260 261 280 310 314 318 319 324 349 364 378 393 396 412 445
## [35] 450 452 461 474 491 492 498 504 507 531 534 545 557 566 584 601

Let us find out the mode value of cr$Credit_History

ModeVal <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
ModeVal(cr$Credit_History)
## [1] 1
## Levels: 0 1

Let’s replace NA with mode value of cr$Credit_History

cr$Credit_History[is.na(cr$Credit_History)]<-ModeVal(cr$Credit_History)

Recheck if there are any NA

which(is.na(cr$Credit_History))
## integer(0)

We repeat the same steps for variables cr$Self_Employed , cr$Dependents, cr$Gender, cr$Married

summary(cr$Self_Employed)
##   No  Yes NA's 
##  500   82   32
cr$Self_Employed[is.na(cr$Self_Employed)]<-'No'

summary(cr$Dependents)
##    0    1    2   3+ NA's 
##  345  102  101   51   15
cr$Dependents[is.na(cr$Dependents)]<-0

summary(cr$Gender)
## Female   Male   NA's 
##    112    489     13
cr$Gender[is.na(cr$Gender)]<-'Male'

summary(cr$Married)
##   No  Yes NA's 
##  213  398    3
cr$Married[is.na(cr$Married)]<-'Yes'

Lets finally check if there are any NA in dataframe.

which(is.na(cr[,]))
##  [1] 4932 4949 4957 4958 4986 5025 5078 5110 5136 5145 5248 5280 5334 5336

So we have cr$Loan_Amount_Term that has NA`` 6. Check levels incr$Loan_Amount_Term```.
If factors are < 10 keep as factor, else convert to integer.

summary(cr$Loan_Amount_Term)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      12     360     360     342     360     480      14
cr$Loan_Amount_Term[is.na(cr$Loan_Amount_Term)]<- median(cr$Loan_Amount_Term, na.rm = T)

Let’s perform a final check if there are any missing values in data.

colSums(is.na(cr))
##            Gender           Married        Dependents         Education 
##                 0                 0                 0                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                 0                 0                 0                 0 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                 0                 0                 0                 0

NONE, GREAT !! WE CAN PROCEED WITH MODELLING FOR CREDIT RISK