Original Dataset
credit_data_track2_part_A.csv
Give summary of the dataset, including size, classes of columns, distribution info.
Load Data
credit <- read.csv("E:/R/TDtest/credit_data_track2_part_A.CSV", stringsAsFactors=FALSE, header=TRUE);
library(DT);
datatable(credit, options = list(pageLength = 5),fillContainer=T)
Numbers of Rows and Columns
Check size:
dim(credit);
## [1] 1003 22
This dataset contains 1003 rows and 22 columns.
Names and Classes of Columns
Check column names:
colnames(credit);
## [1] "checking_status" "duration"
## [3] "credit_history" "purpose"
## [5] "credit_amount" "savings_status"
## [7] "employment" "installment_commitment"
## [9] "personal_status" "other_parties"
## [11] "residence_since" "property_magnitude"
## [13] "age" "asnm"
## [15] "other_payment_plans" "housing"
## [17] "existing_credits" "job"
## [19] "num_dependents" "own_telephone"
## [21] "foreign_worker" "accepted"
All columns has been named semantically.
Explore column classes:
data.frame(lapply(credit[,], class));
## checking_status duration credit_history purpose credit_amount
## 1 character integer character character integer
## savings_status employment installment_commitment personal_status
## 1 character character integer character
## other_parties residence_since property_magnitude age asnm
## 1 character integer character integer integer
## other_payment_plans housing existing_credits job num_dependents
## 1 character character integer character integer
## own_telephone foreign_worker accepted
## 1 character character integer
Or, try with one arbitrary column name:
class(credit$checking_status);
## [1] "character"
Distributions of Values
For character-class columns, find the unique values and their frequencies:
library(plyr);
library(dplyr);
count(credit, credit_history);
## # A tibble: 5 × 2
## credit_history n
## <chr> <int>
## 1 All_credits_paid_duly 49
## 2 Critical_acct_other_credits_existing 294
## 3 Delay_in_past 88
## 4 Existing_credits_paid_till_now 532
## 5 No_credits_taken_or_all_paid 40
There are five unique values for credit_history. The value Existing_credits_paid_till_now takes the most portion as 532, while the value No_credits_taken_or_all_paid takes the least portion as 40.
For interger-class columns, find the min, max, quantiles and average:
summary(credit$duration);
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 12.00 18.00 20.89 24.00 72.00
Handling Missing Values
Check whether there exist missing values (NA) in each column, and find the index if any:
which(is.na(credit$purpose));
## integer(0)
No missing values in the purpose column.
which(is.na(credit$credit_amount));
## [1] 11
There is one missing value in the credit_amount column. And it is in row 11.
Missing Values in Numeric class can be substituted by the mean or median value of the corresponding column, while missing values in Character class by the most-frequent value.
Take the credit_amount column for example:
pos.NA=which(is.na(credit$credit_amount));
credit$credit_amount[pos.NA]=mean(credit$credit_amount, na.rm=T);
credit$credit_amount[pos.NA];
## [1] 1422078
which(is.na(credit$credit_amount));
## integer(0)
The NA in row 11 has been replaced by the value 1422078.
If there are multiple NAs in one column, batch processing can also be implemented with the previous R code, since pos.NA would be a vector. Or in this situation, do it more advancedly by using Gaussian random numbers:
pos.NA=which(is.na(credit$age));
pos.NA;
## [1] 1 34 41 65 72 91 102 114 132 139 180 189 253 269 280 337 340
## [18] 343 369 382 413 419 429 480 482 502 567 570 583 587 635 638 660 675
## [35] 687 760 773 777 798 827 848 856 860 877 878 935 959 962 982 987
There are 50 NAs in the age column.
Num.G=rnorm(n=length(pos.NA), mean=mean(credit$age, na.rm=T), sd=sd(credit$age, na.rm=T));
credit$age[pos.NA]=Num.G;
which(is.na(credit$age));
## integer(0)
The new values will maintain the mean and standard deviation in the orginal setting.
Another popular and energy-saving way for dealing with missing values is to directly remove the rows (records) with NAs:
credit <- read.csv("E:/R/TDtest/credit_data_track2_part_A.CSV", stringsAsFactors=FALSE, header=TRUE);
credit.NoNA=na.omit(credit);
dim(credit.NoNA);
## [1] 952 22
The new dataset contains only 952 rows, due to the aforementioned strategy.
This ‘abnormal’ dataset which contains NAs can be saved for further reference:
credit.NA=setdiff(credit, credit.NoNA);
dim(credit.NA);
## [1] 51 22
write.csv(credit.NA,'E:/R/TDtest/credit_NA.csv', row.names = F)
Dataset with 51 rows has been saved locally.
Data visualization can be presented with static figures using base graphics, ggplot2 or latttice packages. Here the interactive form will be offered using advanced R packages provding javascript interfaces .
For Distribution
Explore the distribution of values in each column. Use bar chart for the character-class columns:
library(htmlwidgets);
library(highcharter);
data.CHA=select(credit.NoNA, which(lapply(credit.NoNA[,],class)=='character'))
value.Frequency.CHA=NULL;
for ( i in 1: ncol(data.CHA)) {
value.Frequency.CHA=rbind(value.Frequency.CHA, cbind(colnames(data.CHA)[i],count(data.CHA, data.CHA[,i])))
};
colnames(value.Frequency.CHA)=c('Feature','Value','Frequency');
hchart(value.Frequency.CHA, 'column', x=Value, y=Frequency, group=Feature) %>%
hc_title(text='Value Distribution for All Character-Class columns');
Use boxplot for the non-character-class (integer class) columns:
data.INT=select(credit.NoNA, -which(lapply(credit.NoNA[,],class)=='character'));
hc=highchart();
for (i in 1:ncol(data.INT)){hc=hc %>% hc_add_series_boxplot(x = data.INT[,i],name = colnames(data.INT)[i], outlier=F)};
hc %>%
hc_title(text='Boxplot for All Integer-Class Columns');
Or we can plot the histogram for the specific column:
hchart(credit$duration, type='colums', name = "duration") %>%
hc_title(text='Histogram for Duration') %>%
hc_subtitle(text='Drag and Zoom in for details')
For Relevance
We can simply explore the correlation between any two numeric columns, group by the status of accpeted:
library(pairsD3);
pairsD3(data.INT[,1:(ncol(data.INT)-1)],group=data.INT[,ncol(data.INT)])
As being seen, features are not highly related except age and asnm. Also the points with different accepted values are not well scattered seperatedly.
Prediction job, usually includes regression for the numeric dependent variable, and classification for the categorical dependent variable. For illustrative purpose, the classification process will be explained.
I will compare three popular classification algorithms, to predict the value for the last column, saying accepted.
K-Nearest Neighbour Method
KNN method is most intuitive method to judge the output for the new inputs, which employs the performences of similar situations. The key effort is to define the distance between two numeric entries. Non-numeric entries can not by manipulated unless extra distances for categorical values are defined.
Use data.INT, the first 200 rows as the test set and other rows as the training set:
library(class);
pred.KNN=knn(train=data.INT[-(1:200),-ncol(data.INT)], test=data.INT[1:200,-ncol(data.INT)], cl=data.INT$accepted[-(1:200)], k=10, l=1);
confusionM.KNN=table(pred.KNN, data.INT[1:200,ncol(data.INT)]);
confusionM.KNN;
##
## pred.KNN 0 1
## 0 122 43
## 1 23 12
prec=confusionM.KNN[1,1]/(confusionM.KNN[1,1]+confusionM.KNN[1,2]);
reca=confusionM.KNN[1,1]/(confusionM.KNN[1,1]+confusionM.KNN[2,1]);
cat('Precision, Recall and F-score of KNN: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of KNN: 0.7393939 0.8413793 0.7870968
Decision Tree Method
Decision Tree employs the entropy and information gain to determine which column (feature) will be splited. Decision Tree Method accepts both continous features and categoraical features. We can use dataset credit.NoNA with the first 200 rows as the test set as well.
library(rpart);
Index=which(lapply(credit.NoNA[,],class)=='character');
for ( i in 1 :length(Index)) {
credit.NoNA[,Index[i]]=as.factor(credit.NoNA[,Index[i]])
};
trainSet=credit.NoNA[-(1:200),];
testSet=credit.NoNA[1:200, -ncol(credit.NoNA)];
config.DT <- rpart.control(xval=10)
model.DT=rpart(as.factor(accepted)~., data=trainSet, control=config.DT);
pred.DT=predict(model.DT, testSet, type='class');
confusionM.DT=table(pred.DT, credit.NoNA[1:200, ncol(credit.NoNA)]);
confusionM.DT;
##
## pred.DT 0 1
## 0 129 34
## 1 16 21
prec=confusionM.DT[1,1]/(confusionM.DT[1,1]+confusionM.DT[1,2]);
reca= confusionM.DT[1,1]/(confusionM.DT[1,1]+confusionM.DT[2,1]);
cat('Precision, Recall and F-score of DT: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of DT: 0.791411 0.8896552 0.8376623
The complexity info can be obtained:
printcp(model.DT);
##
## Classification tree:
## rpart(formula = as.factor(accepted) ~ ., data = trainSet, control = config.DT)
##
## Variables actually used in tree construction:
## [1] age checking_status credit_amount credit_history
## [5] duration employment purpose savings_status
##
## Root node error: 225/752 = 0.2992
##
## n= 752
##
## CP nsplit rel error xerror xstd
## 1 0.046667 0 1.00000 1.00000 0.055809
## 2 0.031111 4 0.78667 0.90222 0.054106
## 3 0.020741 5 0.75556 0.88444 0.053765
## 4 0.013333 8 0.69333 0.88444 0.053765
## 5 0.010000 11 0.65333 0.90222 0.054106
This tree actually involves only nine features. The visualized one is as follows:
library(rpart.plot);
rpart.plot(model.DT, type=4, tweak = 2.5)
The tree should be pruned in case of overfitting.
model.DT.Pruned=prune(model.DT, cp=model.DT$cptable[which.min(model.DT$cptable[,'xerror']),'CP']);
printcp(model.DT.Pruned);
##
## Classification tree:
## rpart(formula = as.factor(accepted) ~ ., data = trainSet, control = config.DT)
##
## Variables actually used in tree construction:
## [1] checking_status credit_history duration purpose
## [5] savings_status
##
## Root node error: 225/752 = 0.2992
##
## n= 752
##
## CP nsplit rel error xerror xstd
## 1 0.046667 0 1.00000 1.00000 0.055809
## 2 0.031111 4 0.78667 0.90222 0.054106
## 3 0.020741 5 0.75556 0.88444 0.053765
rpart.plot(model.DT.Pruned, type=4, tweak=2.5);
Let’s go back and check the accuracy of the pruned tree:
pred.DT.Pruned=predict(model.DT.Pruned, testSet, type='class');
confusionM.DT.Pruned=table(pred.DT.Pruned, credit.NoNA[1:200, ncol(credit.NoNA)]);
confusionM.DT.Pruned;
##
## pred.DT.Pruned 0 1
## 0 130 29
## 1 15 26
prec= confusionM.DT.Pruned[1,1]/(confusionM.DT.Pruned[1,1]+confusionM.DT.Pruned[1,2]);
reca=confusionM.DT.Pruned[1,1]/(confusionM.DT.Pruned[1,1]+confusionM.DT.Pruned[2,1]);
cat('Precision, Recall and F-score of DT.Pruned: ',prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of DT.Pruned: 0.8176101 0.8965517 0.8552632
Random Forest Method
Random forest method is an efficient way of improving accuracy and suppressing overfitting.
library(randomForest);
model.RF=randomForest(as.factor(accepted)~., data=trainSet, ntree=500, mtry=floor(sqrt(ncol(trainSet))));
pred.RF=predict(model.RF, testSet, type='class');
confusionM.RF=table(pred.RF, credit.NoNA[1:200, ncol(credit.NoNA)]);
confusionM.RF;
##
## pred.RF 0 1
## 0 131 36
## 1 14 19
prec=confusionM.RF[1,1]/(confusionM.RF[1,1]+confusionM.RF[1,2]);
reca=confusionM.RF[1,1]/(confusionM.RF[1,1]+confusionM.RF[2,1]);
cat('Precision, Recall and F-score of RF: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of RF: 0.7844311 0.9034483 0.8397436
plot(model.RF);
The errors has converged since 500 trees are gernerated.
SVM Method
library(e1071);
model.SVM=svm(as.factor(accepted)~., data=trainSet, kernel = "radial", type='C-classification');
pred.SVM=predict(model.SVM, testSet, type='class');
confusionM.SVM=table(pred.SVM, credit.NoNA[1:200, ncol(credit.NoNA)]);
confusionM.SVM;
##
## pred.SVM 0 1
## 0 142 43
## 1 3 12
prec= confusionM.SVM[1,1]/(confusionM.SVM[1,1]+confusionM.SVM[1,2]);
reca= confusionM.SVM[1,1]/(confusionM.SVM[1,1]+confusionM.SVM[2,1]);
cat('Precision, Recall and F-score of SVM: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of SVM: 0.7675676 0.9793103 0.8606061
Optimization
Random forest demonstrates the capability of balancing precision and recall. Hence, we choose RF model to optimize. Common strategies of improving the model performance include:
The following gives the process of RF optimization after removing anomlies.
row.Outlier=NULL;
for (i in 1 : ncol(data.INT)){
new.Rows=which(is.element(data.INT[,i], boxplot.stats(data.INT[,i])$out)==T);
row.Outlier=union(row.Outlier, new.Rows)
};
credit.Outlier.Removed=credit[-row.Outlier,];
credit.OutlierNA.Removed=na.omit(credit.Outlier.Removed);
New dataset after outlier detection has been generated.
library(randomForest);
Index=which(lapply(credit.OutlierNA.Removed[,],class)=='character');
for ( i in 1 :length(Index)) {
credit.OutlierNA.Removed[,Index[i]]=as.factor(credit.OutlierNA.Removed[,Index[i]])
};
trainSet.New=credit.OutlierNA.Removed[-(1:140), ];
testSet.New=credit.OutlierNA.Removed[1:140, -ncol(credit.OutlierNA.Removed)];
model.RF.New=randomForest(as.factor(accepted)~., data=trainSet.New, ntree=500, mtry=floor(sqrt(ncol(trainSet.New))));
pred.RF.New=predict(model.RF.New, testSet.New, type='class');
confusionM.RF.New=table(pred.RF.New, credit.OutlierNA.Removed[1:140, ncol(credit.OutlierNA.Removed)]);
confusionM.RF.New;
##
## pred.RF.New 0 1
## 0 100 24
## 1 4 12
prec=confusionM.RF.New[1,1]/(confusionM.RF.New[1,1]+confusionM.RF.New[1,2]);
reca=confusionM.RF.New[1,1]/(confusionM.RF.New[1,1]+confusionM.RF.New[2,1]);
cat('Precision, Recall and F-score of RF optimized: ', prec, reca, 2*prec*reca/(prec+reca));
## Precision, Recall and F-score of RF optimized: 0.8064516 0.9615385 0.877193
If we use ROC curves, the random forest (in blue) should be the best prediction model under the current setting:
library(ROCR);
plot(performance(prediction(as.integer(pred.RF.New), credit.OutlierNA.Removed[1:140, ncol(credit.OutlierNA.Removed)]),'tpr','fpr'),col='red');
par(new=TRUE);
plot(performance(prediction(as.integer(pred.RF), credit.NoNA[1:200, ncol(credit.NoNA)]),'tpr','fpr'),col='blue');
par(new=TRUE);
plot(performance(prediction(as.integer(pred.SVM), credit.NoNA[1:200, ncol(credit.NoNA)]),'tpr','fpr'),col='green');
par(new=TRUE);
plot(performance(prediction(as.integer(pred.DT), credit.NoNA[1:200, ncol(credit.NoNA)]),'tpr','fpr'),col='yellow');
par(new=TRUE);
plot(performance(prediction(as.integer(pred.KNN), data.INT[1:200,ncol(data.INT)]),'tpr','fpr'),col='purple');
Finally, it is concluded that the overall performance is :
RF.New > RF, DT > SVM > KNN