Handle Missing Values In Data

LOAD DATA INTO ENVIRONMENT

The na.strings argument is for substitution within the body of the file, that is, matching strings that should be replaced with NA.

cr<-read.csv("CreditRiskData.csv",header = T,na.strings = c("","","NA"))
summary(cr)

##      Loan_ID       Gender    Married    Dependents        Education  
##  LP001002:  1   Female:112   No  :213   0   :345   Graduate    :480  
##  LP001003:  1   Male  :489   Yes :398   1   :102   Not Graduate:134  
##  LP001005:  1   NA's  : 13   NA's:  3   2   :101                     
##  LP001006:  1                           3+  : 51                     
##  LP001008:  1                           NA's: 15                     
##  LP001011:  1                                                        
##  (Other) :608                                                        
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##  No  :500      Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Yes : 82      1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  NA's: 32      Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

class(cr$Credit_History)

## [1] "integer"

str(cr)

## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Dependents       : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

How to handle missing values? 1.Checkf if column has more than 50% missing values. If yes then remove that column. Other scenario where we could remove the column is when it is not essential for our analysis.

colSums(is.na(cr))

##           Loan_ID            Gender           Married        Dependents 
##                 0                13                 3                15 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                32                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0

cr$Loan_ID<-NULL

Check data type / class of columns. Convert from one data type to other as per the case.

str(cr)

## 'data.frame':    614 obs. of  12 variables:
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Dependents       : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

class(cr$Credit_History)

## [1] "integer"

cr$Credit_History<-as.factor(cr$Credit_History)

If variable has numeric data type and is normally distributed, then impute with mean.

Let’s check if LoanAmount is normally distributed. If it is, impute missing values with mean(cr$LoanAmount)

shapiro.test(cr$LoanAmount)

## 
##  Shapiro-Wilk normality test
## 
## data:  cr$LoanAmount
## W = 0.76752, p-value < 2.2e-16

If variable has numeric data type and is not normally distributed then impute with median.
Let’s check which values are NA within cr$LoanAmount

which(is.na(cr$LoanAmount))

##  [1]   1  36  64  82  96 103 104 114 128 203 285 306 323 339 388 436 438
## [18] 480 525 551 552 606

LoanAmount is not normally distributed. So we use median to replace NA

cr$LoanAmount[is.na(cr$LoanAmount)]<-median(cr$LoanAmount,na.rm = T)

Let’s re-check if there are any NA within cr$LoanAmount.

which(is.na(cr$LoanAmount))

## integer(0)

We can also check for previous specific values replaced with median value. ## [1] 1 36 64 82 96 103 ...

cr$LoanAmount[c(1,36,64,82)]

## [1] 128 128 128 128

If variable type is categorical then impute with mode.
Let’s analyse cr$Credit_History. Let us count the number of NA number of complete values & locate `NA

sum(is.na(cr$Credit_History))

## [1] 50

sum(!is.na(cr$Credit_History))

## [1] 564

which(is.na(cr$Credit_History))

##  [1]  17  25  31  43  80  84  87  96 118 126 130 131 157 182 188 199 220
## [18] 237 238 260 261 280 310 314 318 319 324 349 364 378 393 396 412 445
## [35] 450 452 461 474 491 492 498 504 507 531 534 545 557 566 584 601

Let us find out the mode value of cr$Credit_History

ModeVal <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
ModeVal(cr$Credit_History)

## [1] 1
## Levels: 0 1

Let’s replace NA with mode value of cr$Credit_History

cr$Credit_History[is.na(cr$Credit_History)]<-ModeVal(cr$Credit_History)

Recheck if there are any NA

which(is.na(cr$Credit_History))

## integer(0)

We repeat the same steps for variables cr$Self_Employed , cr$Dependents, cr$Gender, cr$Married

summary(cr$Self_Employed)

##   No  Yes NA's 
##  500   82   32

cr$Self_Employed[is.na(cr$Self_Employed)]<-'No'

summary(cr$Dependents)

##    0    1    2   3+ NA's 
##  345  102  101   51   15

cr$Dependents[is.na(cr$Dependents)]<-0

summary(cr$Gender)

## Female   Male   NA's 
##    112    489     13

cr$Gender[is.na(cr$Gender)]<-'Male'

summary(cr$Married)

##   No  Yes NA's 
##  213  398    3

cr$Married[is.na(cr$Married)]<-'Yes'

Lets finally check if there are any NA in dataframe.

which(is.na(cr[,]))

##  [1] 4932 4949 4957 4958 4986 5025 5078 5110 5136 5145 5248 5280 5334 5336

So we have cr$Loan_Amount_Term that has NA`` 6. Check levels incr$Loan_Amount_Term```.
If factors are < 10 keep as factor, else convert to integer.

summary(cr$Loan_Amount_Term)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      12     360     360     342     360     480      14

cr$Loan_Amount_Term[is.na(cr$Loan_Amount_Term)]<- median(cr$Loan_Amount_Term, na.rm = T)

Let’s perform a final check if there are any missing values in data.

colSums(is.na(cr))

##            Gender           Married        Dependents         Education 
##                 0                 0                 0                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                 0                 0                 0                 0 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                 0                 0                 0                 0

NONE, GREAT !! WE CAN PROCEED WITH MODELLING FOR CREDIT RISK

Handle Missing Values In Data

Anand Jage

19/06/2019