Original Dataset

credit_data_track2_part_A.csv

Descriptive Statistics

Give summary of the dataset, including size, classes of columns, distribution info.

Load Data

credit <- read.csv("E:/R/TDtest/credit_data_track2_part_A.CSV", stringsAsFactors=FALSE, header=TRUE);
library(DT);
datatable(credit, options = list(pageLength = 5),fillContainer=T)

Numbers of Rows and Columns

Check size:

dim(credit);
## [1] 1003   22

This dataset contains 1003 rows and 22 columns.

Names and Classes of Columns

Check column names:

colnames(credit);
##  [1] "checking_status"        "duration"              
##  [3] "credit_history"         "purpose"               
##  [5] "credit_amount"          "savings_status"        
##  [7] "employment"             "installment_commitment"
##  [9] "personal_status"        "other_parties"         
## [11] "residence_since"        "property_magnitude"    
## [13] "age"                    "asnm"                  
## [15] "other_payment_plans"    "housing"               
## [17] "existing_credits"       "job"                   
## [19] "num_dependents"         "own_telephone"         
## [21] "foreign_worker"         "accepted"

All columns has been named semantically.

Explore column classes:

data.frame(lapply(credit[,], class));
##   checking_status duration credit_history   purpose credit_amount
## 1       character  integer      character character       integer
##   savings_status employment installment_commitment personal_status
## 1      character  character                integer       character
##   other_parties residence_since property_magnitude     age    asnm
## 1     character         integer          character integer integer
##   other_payment_plans   housing existing_credits       job num_dependents
## 1           character character          integer character        integer
##   own_telephone foreign_worker accepted
## 1     character      character  integer

Or, try with one arbitrary column name:

class(credit$checking_status);
## [1] "character"

Distributions of Values

For character-class columns, find the unique values and their frequencies:

library(plyr);
library(dplyr);
count(credit, credit_history);
## # A tibble: 5 × 2
##                         credit_history     n
##                                  <chr> <int>
## 1                All_credits_paid_duly    49
## 2 Critical_acct_other_credits_existing   294
## 3                        Delay_in_past    88
## 4       Existing_credits_paid_till_now   532
## 5         No_credits_taken_or_all_paid    40

There are five unique values for credit_history. The value Existing_credits_paid_till_now takes the most portion as 532, while the value No_credits_taken_or_all_paid takes the least portion as 40.

For interger-class columns, find the min, max, quantiles and average:

summary(credit$duration);
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   12.00   18.00   20.89   24.00   72.00

Handling Missing Values

Check whether there exist missing values (NA) in each column, and find the index if any:

which(is.na(credit$purpose));
## integer(0)

No missing values in the purpose column.

which(is.na(credit$credit_amount));
## [1] 11

There is one missing value in the credit_amount column. And it is in row 11.

Missing Values in Numeric class can be substituted by the mean or median value of the corresponding column, while missing values in Character class by the most-frequent value.

Take the credit_amount column for example:

pos.NA=which(is.na(credit$credit_amount));
credit$credit_amount[pos.NA]=mean(credit$credit_amount, na.rm=T);
credit$credit_amount[pos.NA];
## [1] 1422078
which(is.na(credit$credit_amount));
## integer(0)

The NA in row 11 has been replaced by the value 1422078.

If there are multiple NAs in one column, batch processing can also be implemented with the previous R code, since pos.NA would be a vector. Or in this situation, do it more advancedly by using Gaussian random numbers:

pos.NA=which(is.na(credit$age));
pos.NA;
##  [1]   1  34  41  65  72  91 102 114 132 139 180 189 253 269 280 337 340
## [18] 343 369 382 413 419 429 480 482 502 567 570 583 587 635 638 660 675
## [35] 687 760 773 777 798 827 848 856 860 877 878 935 959 962 982 987

There are 50 NAs in the age column.

Num.G=rnorm(n=length(pos.NA), mean=mean(credit$age, na.rm=T), sd=sd(credit$age, na.rm=T));
credit$age[pos.NA]=Num.G;
which(is.na(credit$age));
## integer(0)

The new values will maintain the mean and standard deviation in the orginal setting.

Another popular and energy-saving way for dealing with missing values is to directly remove the rows (records) with NAs:

credit <- read.csv("E:/R/TDtest/credit_data_track2_part_A.CSV", stringsAsFactors=FALSE, header=TRUE);
credit.NoNA=na.omit(credit);
dim(credit.NoNA);
## [1] 952  22

The new dataset contains only 952 rows, due to the aforementioned strategy.

This ‘abnormal’ dataset which contains NAs can be saved for further reference:

credit.NA=setdiff(credit, credit.NoNA);
dim(credit.NA);
## [1] 51 22
write.csv(credit.NA,'E:/R/TDtest/credit_NA.csv', row.names = F)

Dataset with 51 rows has been saved locally.

Visualization

Data visualization can be presented with static figures using base graphics, ggplot2 or latttice packages. Here the interactive form will be offered using advanced R packages provding javascript interfaces .

For Distribution

Explore the distribution of values in each column. Use bar chart for the character-class columns:

library(htmlwidgets);
library(highcharter);
data.CHA=select(credit.NoNA, which(lapply(credit.NoNA[,],class)=='character'))
value.Frequency.CHA=NULL;
for ( i in 1: ncol(data.CHA)) {
  value.Frequency.CHA=rbind(value.Frequency.CHA, cbind(colnames(data.CHA)[i],count(data.CHA, data.CHA[,i])))
};
colnames(value.Frequency.CHA)=c('Feature','Value','Frequency');
hchart(value.Frequency.CHA, 'column', x=Value, y=Frequency, group=Feature) %>%
  hc_title(text='Value Distribution for All Character-Class columns');

Use boxplot for the non-character-class (integer class) columns:

data.INT=select(credit.NoNA, -which(lapply(credit.NoNA[,],class)=='character'));
hc=highchart();
for (i in 1:ncol(data.INT)){hc=hc %>% hc_add_series_boxplot(x = data.INT[,i],name = colnames(data.INT)[i], outlier=F)};
hc %>%
  hc_title(text='Boxplot for All Integer-Class Columns');

Or we can plot the histogram for the specific column:

hchart(credit$duration, type='colums', name = "duration") %>%
  hc_title(text='Histogram for Duration') %>%
  hc_subtitle(text='Drag and Zoom in for details')

For Relevance

We can simply explore the correlation between any two numeric columns, group by the status of accpeted:

library(pairsD3);
pairsD3(data.INT[,1:(ncol(data.INT)-1)],group=data.INT[,ncol(data.INT)])

As being seen, features are not highly related except age and asnm. Also the points with different accepted values are not well scattered seperatedly.

Predication

Prediction job, usually includes regression for the numeric dependent variable, and classification for the categorical dependent variable. For illustrative purpose, the classification process will be explained.

I will compare three popular classification algorithms, to predict the value for the last column, saying accepted.

K-Nearest Neighbour Method

KNN method is most intuitive method to judge the output for the new inputs, which employs the performences of similar situations. The key effort is to define the distance between two numeric entries. Non-numeric entries can not by manipulated unless extra distances for categorical values are defined.

Use data.INT, the first 200 rows as the test set and other rows as the training set:

library(class);
pred.KNN=knn(train=data.INT[-(1:200),-ncol(data.INT)], test=data.INT[1:200,-ncol(data.INT)], cl=data.INT$accepted[-(1:200)], k=10, l=1);
confusionM.KNN=table(pred.KNN, data.INT[1:200,ncol(data.INT)]);
confusionM.KNN;
##         
## pred.KNN   0   1
##        0 122  43
##        1  23  12
prec=confusionM.KNN[1,1]/(confusionM.KNN[1,1]+confusionM.KNN[1,2]);
reca=confusionM.KNN[1,1]/(confusionM.KNN[1,1]+confusionM.KNN[2,1]);
cat('Precision, Recall and F-score of KNN: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of KNN:  0.7393939 0.8413793 0.7870968

Decision Tree Method

Decision Tree employs the entropy and information gain to determine which column (feature) will be splited. Decision Tree Method accepts both continous features and categoraical features. We can use dataset credit.NoNA with the first 200 rows as the test set as well.

library(rpart);
Index=which(lapply(credit.NoNA[,],class)=='character');
for ( i in 1 :length(Index)) {
  credit.NoNA[,Index[i]]=as.factor(credit.NoNA[,Index[i]])
};
trainSet=credit.NoNA[-(1:200),];
testSet=credit.NoNA[1:200, -ncol(credit.NoNA)];
config.DT <- rpart.control(xval=10) 
model.DT=rpart(as.factor(accepted)~., data=trainSet, control=config.DT);
pred.DT=predict(model.DT, testSet, type='class');
confusionM.DT=table(pred.DT, credit.NoNA[1:200, ncol(credit.NoNA)]);
confusionM.DT;
##        
## pred.DT   0   1
##       0 129  34
##       1  16  21
prec=confusionM.DT[1,1]/(confusionM.DT[1,1]+confusionM.DT[1,2]);
reca= confusionM.DT[1,1]/(confusionM.DT[1,1]+confusionM.DT[2,1]);
cat('Precision, Recall and F-score of DT: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of DT:  0.791411 0.8896552 0.8376623

The complexity info can be obtained:

printcp(model.DT);
## 
## Classification tree:
## rpart(formula = as.factor(accepted) ~ ., data = trainSet, control = config.DT)
## 
## Variables actually used in tree construction:
## [1] age             checking_status credit_amount   credit_history 
## [5] duration        employment      purpose         savings_status 
## 
## Root node error: 225/752 = 0.2992
## 
## n= 752 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.046667      0   1.00000 1.00000 0.055809
## 2 0.031111      4   0.78667 0.90222 0.054106
## 3 0.020741      5   0.75556 0.88444 0.053765
## 4 0.013333      8   0.69333 0.88444 0.053765
## 5 0.010000     11   0.65333 0.90222 0.054106

This tree actually involves only nine features. The visualized one is as follows:

library(rpart.plot);
rpart.plot(model.DT, type=4, tweak = 2.5)

The tree should be pruned in case of overfitting.

model.DT.Pruned=prune(model.DT, cp=model.DT$cptable[which.min(model.DT$cptable[,'xerror']),'CP']);
printcp(model.DT.Pruned);
## 
## Classification tree:
## rpart(formula = as.factor(accepted) ~ ., data = trainSet, control = config.DT)
## 
## Variables actually used in tree construction:
## [1] checking_status credit_history  duration        purpose        
## [5] savings_status 
## 
## Root node error: 225/752 = 0.2992
## 
## n= 752 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.046667      0   1.00000 1.00000 0.055809
## 2 0.031111      4   0.78667 0.90222 0.054106
## 3 0.020741      5   0.75556 0.88444 0.053765
rpart.plot(model.DT.Pruned, type=4, tweak=2.5);

Let’s go back and check the accuracy of the pruned tree:

pred.DT.Pruned=predict(model.DT.Pruned, testSet, type='class');
confusionM.DT.Pruned=table(pred.DT.Pruned, credit.NoNA[1:200, ncol(credit.NoNA)]);
confusionM.DT.Pruned;
##               
## pred.DT.Pruned   0   1
##              0 130  29
##              1  15  26
prec= confusionM.DT.Pruned[1,1]/(confusionM.DT.Pruned[1,1]+confusionM.DT.Pruned[1,2]);
reca=confusionM.DT.Pruned[1,1]/(confusionM.DT.Pruned[1,1]+confusionM.DT.Pruned[2,1]);
cat('Precision, Recall and F-score of DT.Pruned: ',prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of DT.Pruned:  0.8176101 0.8965517 0.8552632

Random Forest Method

Random forest method is an efficient way of improving accuracy and suppressing overfitting.

library(randomForest);
model.RF=randomForest(as.factor(accepted)~., data=trainSet, ntree=500, mtry=floor(sqrt(ncol(trainSet))));
pred.RF=predict(model.RF, testSet, type='class');
confusionM.RF=table(pred.RF, credit.NoNA[1:200, ncol(credit.NoNA)]);
confusionM.RF;
##        
## pred.RF   0   1
##       0 131  36
##       1  14  19
prec=confusionM.RF[1,1]/(confusionM.RF[1,1]+confusionM.RF[1,2]);
reca=confusionM.RF[1,1]/(confusionM.RF[1,1]+confusionM.RF[2,1]);
cat('Precision, Recall and F-score of RF: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of RF:  0.7844311 0.9034483 0.8397436
plot(model.RF);

The errors has converged since 500 trees are gernerated.

SVM Method

library(e1071);
model.SVM=svm(as.factor(accepted)~., data=trainSet, kernel = "radial", type='C-classification');
pred.SVM=predict(model.SVM, testSet, type='class');
confusionM.SVM=table(pred.SVM, credit.NoNA[1:200, ncol(credit.NoNA)]);
confusionM.SVM;
##         
## pred.SVM   0   1
##        0 142  43
##        1   3  12
prec= confusionM.SVM[1,1]/(confusionM.SVM[1,1]+confusionM.SVM[1,2]);
reca= confusionM.SVM[1,1]/(confusionM.SVM[1,1]+confusionM.SVM[2,1]);
cat('Precision, Recall and F-score of SVM: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of SVM:  0.7675676 0.9793103 0.8606061

Optimization

Random forest demonstrates the capability of balancing precision and recall. Hence, we choose RF model to optimize. Common strategies of improving the model performance include:

The following gives the process of RF optimization after removing anomlies.

row.Outlier=NULL;
for (i in 1 : ncol(data.INT)){
  new.Rows=which(is.element(data.INT[,i], boxplot.stats(data.INT[,i])$out)==T);
  row.Outlier=union(row.Outlier, new.Rows)
};
credit.Outlier.Removed=credit[-row.Outlier,];
credit.OutlierNA.Removed=na.omit(credit.Outlier.Removed);

New dataset after outlier detection has been generated.

library(randomForest);
Index=which(lapply(credit.OutlierNA.Removed[,],class)=='character');
for ( i in 1 :length(Index)) {
  credit.OutlierNA.Removed[,Index[i]]=as.factor(credit.OutlierNA.Removed[,Index[i]])
};
trainSet.New=credit.OutlierNA.Removed[-(1:140), ];
testSet.New=credit.OutlierNA.Removed[1:140, -ncol(credit.OutlierNA.Removed)];
model.RF.New=randomForest(as.factor(accepted)~., data=trainSet.New, ntree=500, mtry=floor(sqrt(ncol(trainSet.New))));
pred.RF.New=predict(model.RF.New, testSet.New, type='class');
confusionM.RF.New=table(pred.RF.New, credit.OutlierNA.Removed[1:140, ncol(credit.OutlierNA.Removed)]);
confusionM.RF.New;
##            
## pred.RF.New   0   1
##           0 100  24
##           1   4  12
prec=confusionM.RF.New[1,1]/(confusionM.RF.New[1,1]+confusionM.RF.New[1,2]);
reca=confusionM.RF.New[1,1]/(confusionM.RF.New[1,1]+confusionM.RF.New[2,1]);
cat('Precision, Recall and F-score of RF optimized: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of RF optimized:  0.8064516 0.9615385 0.877193

If we use ROC curves, the random forest (in blue) should be the best prediction model under the current setting:

library(ROCR);
plot(performance(prediction(as.integer(pred.RF.New), credit.OutlierNA.Removed[1:140, ncol(credit.OutlierNA.Removed)]),'tpr','fpr'),col='red');
par(new=TRUE);
plot(performance(prediction(as.integer(pred.RF), credit.NoNA[1:200, ncol(credit.NoNA)]),'tpr','fpr'),col='blue');
par(new=TRUE);
plot(performance(prediction(as.integer(pred.SVM), credit.NoNA[1:200, ncol(credit.NoNA)]),'tpr','fpr'),col='green');
par(new=TRUE);
plot(performance(prediction(as.integer(pred.DT), credit.NoNA[1:200, ncol(credit.NoNA)]),'tpr','fpr'),col='yellow');
par(new=TRUE);
plot(performance(prediction(as.integer(pred.KNN), data.INT[1:200,ncol(data.INT)]),'tpr','fpr'),col='purple');

Finally, it is concluded that the overall performance is :

RF.New > RF, DT > SVM > KNN