This dataset is used to predict credit defaulters using decision tree.The independent variables are related to customer details and based on that the dependent variable pnm(ability to pay on month) can be predicted.
library(readxl)
default_of_credit_card_clients <- read_excel("C:/Users/DELL/Desktop/Imarticus/Assignments excel/default of credit card clients.xls")
View(default_of_credit_card_clients)
def_1 is a assigned variable for our original dataset in order to avoid any mistakes or deletion in our original dataset.
Head function gives us the top 6 variables of our dataset to give us the basic understanding of our data.
str stands for structure of the dataset to find out which are characters and which are numerical.
str(def_1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30000 obs. of 25 variables:
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ LIMIT_BAL: num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
$ SEX : num 2 2 2 2 1 1 1 2 2 1 ...
$ EDUCATION: num 2 2 2 2 2 1 1 2 3 3 ...
$ MARRIAGE : num 1 2 2 1 1 2 2 2 1 2 ...
$ AGE : num 24 26 34 37 57 37 29 23 28 35 ...
$ PAY_0 : num 2 -1 0 0 -1 0 0 0 0 -2 ...
$ PAY_2 : num 2 2 0 0 0 0 0 -1 0 -2 ...
$ PAY_3 : num -1 0 0 0 -1 0 0 -1 2 -2 ...
$ PAY_4 : num -1 0 0 0 0 0 0 0 0 -2 ...
$ PAY_5 : num -2 0 0 0 0 0 0 0 0 -1 ...
$ PAY_6 : num -2 2 0 0 0 0 0 -1 0 -1 ...
$ BILL_AMT1: num 3913 2682 29239 46990 8617 ...
$ BILL_AMT2: num 3102 1725 14027 48233 5670 ...
$ BILL_AMT3: num 689 2682 13559 49291 35835 ...
$ BILL_AMT4: num 0 3272 14331 28314 20940 ...
$ BILL_AMT5: num 0 3455 14948 28959 19146 ...
$ BILL_AMT6: num 0 3261 15549 29547 19131 ...
$ PAY_AMT1 : num 0 0 1518 2000 2000 ...
$ PAY_AMT2 : num 689 1000 1500 2019 36681 ...
$ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
$ PAY_AMT4 : num 0 1000 1000 1100 9000 ...
$ PAY_AMT5 : num 0 0 1000 1069 689 ...
$ PAY_AMT6 : num 0 2000 5000 1000 679 ...
$ pnm : num 1 1 0 0 0 0 0 0 0 0 ...
Summary of the dataset gives us the minimum value,maximum value, quartile values,mean,median.
summary(def_1)
ID LIMIT_BAL SEX EDUCATION
Min. : 1 Min. : 10000 Min. :1.000 Min. :0.000
1st Qu.: 7501 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000
Median :15000 Median : 140000 Median :2.000 Median :2.000
Mean :15000 Mean : 167484 Mean :1.604 Mean :1.853
3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :30000 Max. :1000000 Max. :2.000 Max. :6.000
MARRIAGE AGE PAY_0 PAY_2
Min. :0.000 Min. :21.00 Min. :-2.0000 Min. :-2.0000
1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000
Median :2.000 Median :34.00 Median : 0.0000 Median : 0.0000
Mean :1.552 Mean :35.49 Mean :-0.0167 Mean :-0.1338
3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. :3.000 Max. :79.00 Max. : 8.0000 Max. : 8.0000
PAY_3 PAY_4 PAY_5 PAY_6
Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :-0.1662 Mean :-0.2207 Mean :-0.2662 Mean :-0.2911
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4
Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000
1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327
Median : 22382 Median : 21200 Median : 20089 Median : 19052
Mean : 51223 Mean : 49179 Mean : 47013 Mean : 43263
3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506
Max. : 964511 Max. :983931 Max. :1664089 Max. : 891586
BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
Min. :-81334 Min. :-339603 Min. : 0 Min. : 0
1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833
Median : 18105 Median : 17071 Median : 2100 Median : 2009
Mean : 40311 Mean : 38872 Mean : 5664 Mean : 5921
3rd Qu.: 50191 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000
Max. :927171 Max. : 961664 Max. :873552 Max. :1684259
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8
Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0
Mean : 5226 Mean : 4826 Mean : 4799.4 Mean : 5215.5
3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
Max. :896040 Max. :621000 Max. :426529.0 Max. :528666.0
pnm
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.2212
3rd Qu.:0.0000
Max. :1.0000
View function is used to view our dataset.
We are assigning pnm,Education and Sex as factors in order to plot or make the tree based on these variables.
summary(def_1)
ID LIMIT_BAL SEX EDUCATION MARRIAGE
Min. : 1 Min. : 10000 1:11888 1 :10585 Min. :0.000
1st Qu.: 7501 1st Qu.: 50000 2:18112 2 :14030 1st Qu.:1.000
Median :15000 Median : 140000 3 : 4917 Median :2.000
Mean :15000 Mean : 167484 4 : 123 Mean :1.552
3rd Qu.:22500 3rd Qu.: 240000 5 : 280 3rd Qu.:2.000
Max. :30000 Max. :1000000 6 : 51 Max. :3.000
NA's: 14
AGE PAY_0 PAY_2 PAY_3
Min. :21.00 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
Median :34.00 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :35.49 Mean :-0.0167 Mean :-0.1338 Mean :-0.1662
3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. :79.00 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
PAY_4 PAY_5 PAY_6 BILL_AMT1
Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-165580
1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.: 3559
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 22382
Mean :-0.2207 Mean :-0.2662 Mean :-0.2911 Mean : 51223
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 67091
Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 964511
BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5
Min. :-69777 Min. :-157264 Min. :-170000 Min. :-81334
1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327 1st Qu.: 1763
Median : 21200 Median : 20089 Median : 19052 Median : 18105
Mean : 49179 Mean : 47013 Mean : 43263 Mean : 40311
3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506 3rd Qu.: 50191
Max. :983931 Max. :1664089 Max. : 891586 Max. :927171
BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3
Min. :-339603 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833 1st Qu.: 390
Median : 17071 Median : 2100 Median : 2009 Median : 1800
Mean : 38872 Mean : 5664 Mean : 5921 Mean : 5226
3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000 3rd Qu.: 4505
Max. : 961664 Max. :873552 Max. :1684259 Max. :896040
PAY_AMT4 PAY_AMT5 PAY_AMT6 pnm
Min. : 0 Min. : 0.0 Min. : 0.0 0:23364
1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8 1: 6636
Median : 1500 Median : 1500.0 Median : 1500.0
Mean : 4826 Mean : 4799.4 Mean : 5215.5
3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
Max. :621000 Max. :426529.0 Max. :528666.0
Library catools is used for splitting the data randomly for train and test and it is assigned as sample_def.Sample ratio is taken as 0.8 because we need to split 80% of our original dataset for train data and 20% for our test data.
sample_def=sample.split(def_1,SplitRatio = 0.8)
sample_def=sample.split(def_1,SplitRatio = 0.8)
sample_def
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[13] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[25] FALSE
Train data(80% random data) is created as train_def. Train data is a subset of our original data and true values from our sample data.
Test data(20% random data) is created as test_def. Test data is a subset of our original data and false(values other than train data) values from our sample data.
Library rpart is used to create our model to train the machine for our prediction. We need to specify minimum split(minimum splitsize is a limit to stop further splitting of nodes), minimum bucket(the minimum number of observations in any terminal node), maximum depth(length of the longest path from the tree root to a leaf) and cp(complexity parameter is used to control the size of the decision tree and to select the optimal tree size). Method class is used as our dependent variable is dichotomous variable(Example:0 or 1)
Complexity paramater is checked for our model. In x error coloumn there will be a slight rise in the error values after a particular value. That particular value after which there is an increase in error values is considered as our cp value. The graph also shows that increase or change in error values. But in our model we have taken cp value as 0 in order to get a tree with more split considering many variables for a good tree graph.
Now we plot our decision tree model. The graph gives our decision tree.
def_predict is an assigned variable for our prediction. Prediction is done with our model and test data.Type class is used since our dependent variable is dichotomous.
def_predict=predict(def_model,test_def,type="class")
Table gives us the accuracy of our prediction by considering the actual value and our predicted value. Here we get 82.4% accuracy for our prediction. Above 80% accuracy is considered as good prediction.
def_accu is the assigned variable for accuracy calculation.
Here we get 82.4% accuracy for our prediction. Above 80% accuracy is considered as good prediction.
print(paste("Accuracy is",def_accu))
[1] "Accuracy is 0.824166666666667"