LOAD DATA INTO ENVIRONMENT
The na.strings
argument is for substitution within the body of the file, that is, matching strings that should be replaced with NA
.
cr<-read.csv("CreditRiskData.csv",header = T,na.strings = c("","","NA"))
summary(cr)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 Female:112 No :213 0 :345 Graduate :480
## LP001003: 1 Male :489 Yes :398 1 :102 Not Graduate:134
## LP001005: 1 NA's : 13 NA's: 3 2 :101
## LP001006: 1 3+ : 51
## LP001008: 1 NA's: 15
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## No :500 Min. : 150 Min. : 0 Min. : 9.0
## Yes : 82 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## NA's: 32 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
class(cr$Credit_History)
## [1] "integer"
str(cr)
## 'data.frame': 614 obs. of 13 variables:
## $ Loan_ID : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Married : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
## $ Dependents : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ Self_Employed : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
## $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
## $ CoapplicantIncome: num 0 1508 0 2358 0 ...
## $ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
## $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
How to handle missing values? 1.Checkf if column has more than 50% missing values. If yes then remove that column. Other scenario where we could remove the column is when it is not essential for our analysis.
colSums(is.na(cr))
## Loan_ID Gender Married Dependents
## 0 13 3 15
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 32 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 22 14 50 0
## Loan_Status
## 0
cr$Loan_ID<-NULL
str(cr)
## 'data.frame': 614 obs. of 12 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Married : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
## $ Dependents : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ Self_Employed : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
## $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
## $ CoapplicantIncome: num 0 1508 0 2358 0 ...
## $ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
## $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
class(cr$Credit_History)
## [1] "integer"
cr$Credit_History<-as.factor(cr$Credit_History)
Let’s check if LoanAmount
is normally distributed. If it is, impute missing values with mean(cr$LoanAmount)
shapiro.test(cr$LoanAmount)
##
## Shapiro-Wilk normality test
##
## data: cr$LoanAmount
## W = 0.76752, p-value < 2.2e-16
NA
within cr$LoanAmount
which(is.na(cr$LoanAmount))
## [1] 1 36 64 82 96 103 104 114 128 203 285 306 323 339 388 436 438
## [18] 480 525 551 552 606
LoanAmount
is not normally distributed. So we use median to replace NA
cr$LoanAmount[is.na(cr$LoanAmount)]<-median(cr$LoanAmount,na.rm = T)
Let’s re-check if there are any NA
within cr$LoanAmount
.
which(is.na(cr$LoanAmount))
## integer(0)
We can also check for previous specific values replaced with median value. ## [1] 1 36 64 82 96 103 ...
cr$LoanAmount[c(1,36,64,82)]
## [1] 128 128 128 128
cr$Credit_History
. Let us count the number of NA
number of complete values & locate `NA
sum(is.na(cr$Credit_History))
## [1] 50
sum(!is.na(cr$Credit_History))
## [1] 564
which(is.na(cr$Credit_History))
## [1] 17 25 31 43 80 84 87 96 118 126 130 131 157 182 188 199 220
## [18] 237 238 260 261 280 310 314 318 319 324 349 364 378 393 396 412 445
## [35] 450 452 461 474 491 492 498 504 507 531 534 545 557 566 584 601
Let us find out the mode value of cr$Credit_History
ModeVal <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
ModeVal(cr$Credit_History)
## [1] 1
## Levels: 0 1
Let’s replace NA
with mode value of cr$Credit_History
cr$Credit_History[is.na(cr$Credit_History)]<-ModeVal(cr$Credit_History)
Recheck if there are any NA
which(is.na(cr$Credit_History))
## integer(0)
We repeat the same steps for variables cr$Self_Employed , cr$Dependents, cr$Gender, cr$Married
summary(cr$Self_Employed)
## No Yes NA's
## 500 82 32
cr$Self_Employed[is.na(cr$Self_Employed)]<-'No'
summary(cr$Dependents)
## 0 1 2 3+ NA's
## 345 102 101 51 15
cr$Dependents[is.na(cr$Dependents)]<-0
summary(cr$Gender)
## Female Male NA's
## 112 489 13
cr$Gender[is.na(cr$Gender)]<-'Male'
summary(cr$Married)
## No Yes NA's
## 213 398 3
cr$Married[is.na(cr$Married)]<-'Yes'
Lets finally check if there are any NA in dataframe.
which(is.na(cr[,]))
## [1] 4932 4949 4957 4958 4986 5025 5078 5110 5136 5145 5248 5280 5334 5336
So we have cr$Loan_Amount_Term
that has NA`` 6. Check levels in
cr$Loan_Amount_Term```.
If factors are < 10 keep as factor, else convert to integer.
summary(cr$Loan_Amount_Term)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12 360 360 342 360 480 14
cr$Loan_Amount_Term[is.na(cr$Loan_Amount_Term)]<- median(cr$Loan_Amount_Term, na.rm = T)
Let’s perform a final check if there are any missing values in data.
colSums(is.na(cr))
## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## 0 0 0 0
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## 0 0 0 0
NONE, GREAT !! WE CAN PROCEED WITH MODELLING FOR CREDIT RISK