Can the status of a mortgage borrowers of U.S. residential’s be predicted at its respective observation time?
The data set mortgage is in panel form and reports origination and performance observations for 50,000 residential U.S. mortgage borrowers are over 60 periods. The periods have been deidentified. As in the real world,loans may originate before the start of the observation period. The dataset is a randomized selection of mortgage-loan level data collected from the portfolios underlying U.S. residential mortgage backed securities securitization portfolios.
| id | time | orig_time | first_time | mat_time | balance_time | LTV_time | interest_rate_time | hpi_time | gdp_time | uer_time | REtype_CO_orig_time | REtype_PU_orig_time | REtype_SF_orig_time | investor_orig_time | balance_orig_time | FICO_orig_time | LTV_orig_time | Interest_Rate_orig_time | hpi_orig_time | default_time | payoff_time | status_time |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 25 | -7 | 25 | 113 | 41303.42 | 24.49834 | 9.2 | 226.29 | 2.8991367 | 4.7 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 26 | -7 | 25 | 113 | 41061.95 | 24.48387 | 9.2 | 225.10 | 2.1513649 | 4.7 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 27 | -7 | 25 | 113 | 40804.42 | 24.62680 | 9.2 | 222.39 | 2.3617217 | 4.4 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 28 | -7 | 25 | 113 | 40483.89 | 24.73588 | 9.2 | 219.67 | 1.2291722 | 4.6 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 29 | -7 | 25 | 113 | 40367.06 | 24.92548 | 9.2 | 217.37 | 1.6929687 | 4.5 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 30 | -7 | 25 | 113 | 40127.97 | 25.31829 | 9.2 | 212.73 | 2.2742178 | 4.7 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 31 | -7 | 25 | 113 | 39718.66 | 26.56612 | 9.2 | 200.67 | 1.8506892 | 4.7 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 32 | -7 | 25 | 113 | 35877.03 | 25.87256 | 9.2 | 186.12 | 1.1041628 | 5.0 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 33 | -7 | 25 | 113 | 34410.03 | 25.58443 | 9.2 | 180.52 | 0.8368587 | 5.0 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 34 | -7 | 25 | 113 | 33590.47 | 26.00807 | 9.2 | 173.35 | -0.3144477 | 5.8 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 35 | -7 | 25 | 113 | 32952.48 | 27.28650 | 9.2 | 162.09 | -2.8058441 | 6.5 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 36 | -7 | 25 | 113 | 32688.30 | 28.96363 | 9.2 | 151.48 | -3.5165680 | 7.8 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
| 1 | 37 | -7 | 25 | 113 | 32388.30 | 28.34786 | 9.2 | 153.35 | -4.1467109 | 9.0 | 0 | 0 | 1 | 0 | 45000 | 715 | 69.4 | 9.2 | 87.03 | 0 | 0 | 0 |
id: Borrower ID time: Time stamp of observation orig_time: Time stamp for origination first_time: Time stamp for first observation mat_time: Time stamp for maturity balance_time: Outstanding balance at observation time LTV_time: Loan-to-value ratio at observation time, in % interest_rate_time: Interest rate at observation time, in % hpi_time: House price index at observation time, base year = 100 gdp_time: Gross domestic product (GDP) growth at observation time, in % uer_time: Unemployment rate at observation time, in % REtype_CO_orig_time: Real estate type condominium = 1, otherwise = 0 REtype_PU_orig_time: Real estate type planned urban development = 1, otherwise = 0 REtype_SF_orig_time: Single-family home = 1, otherwise = 0 investor_orig_time: Investor borrower = 1, otherwise = 0 balance_orig_time: Outstanding balance at origination time FICO_orig_time: FICO score at origination time, in % LTV_orig_time: Loan-to-value ratio at origination time, in % Interest_Rate_orig_time: Interest rate at origination time, in % hpi_orig_time: House price index at origination time, base year = 100 default_time: Default observation at observation time payoff_time: Payoff observation at observation time status_time: Default (1), payoff (2), and nondefault/nonpayoff (0) observation at observation time
The status_time is an outcome variable whereas remaining variables will act as a explanatory variables but the variable ‘id’ is omitted as it is an customers/borrowers identity which will be insignificant for the model building.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
mortgage1<-select(mortgage1,-c("id"))
sum(is.na(mortgage1))
[1] 270
summary(mortgage1)
time orig_time first_time mat_time
Min. : 1.0 Min. :-40.00 Min. : 1.00 Min. : 18.0
1st Qu.:27.0 1st Qu.: 18.00 1st Qu.:21.00 1st Qu.:137.0
Median :34.0 Median : 22.00 Median :25.00 Median :142.0
Mean :35.8 Mean : 20.57 Mean :24.61 Mean :137.2
3rd Qu.:44.0 3rd Qu.: 25.00 3rd Qu.:28.00 3rd Qu.:145.0
Max. :60.0 Max. : 60.00 Max. :60.00 Max. :229.0
balance_time LTV_time interest_rate_time hpi_time
Min. : 0 Min. : 0.00 Min. : 0.000 Min. :107.8
1st Qu.: 102017 1st Qu.: 67.11 1st Qu.: 5.650 1st Qu.:158.6
Median : 180618 Median : 82.25 Median : 6.625 Median :180.5
Mean : 245965 Mean : 83.08 Mean : 6.702 Mean :184.1
3rd Qu.: 337495 3rd Qu.:100.63 3rd Qu.: 7.875 3rd Qu.:212.7
Max. :8701859 Max. :803.51 Max. :37.500 Max. :226.3
NA's :270
gdp_time uer_time REtype_CO_orig_time REtype_PU_orig_time
Min. :-4.147 Min. : 3.800 Min. :0.0000 Min. :0.0000
1st Qu.: 1.104 1st Qu.: 4.700 1st Qu.:0.0000 1st Qu.:0.0000
Median : 1.851 Median : 5.700 Median :0.0000 Median :0.0000
Mean : 1.381 Mean : 6.517 Mean :0.0676 Mean :0.1248
3rd Qu.: 2.694 3rd Qu.: 8.200 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. : 5.132 Max. :10.000 Max. :1.0000 Max. :1.0000
REtype_SF_orig_time investor_orig_time balance_orig_time FICO_orig_time
Min. :0.0000 Min. :0.0000 Min. : 0 Min. :400.0
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 108000 1st Qu.:626.0
Median :1.0000 Median :0.0000 Median : 188000 Median :678.0
Mean :0.6121 Mean :0.1382 Mean : 256254 Mean :673.6
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 352000 3rd Qu.:729.0
Max. :1.0000 Max. :1.0000 Max. :8000000 Max. :840.0
LTV_orig_time Interest_Rate_orig_time hpi_orig_time
Min. : 50.10 Min. : 0.000 Min. : 75.71
1st Qu.: 75.00 1st Qu.: 5.000 1st Qu.:179.45
Median : 80.00 Median : 6.290 Median :216.77
Mean : 78.98 Mean : 5.650 Mean :198.12
3rd Qu.: 80.00 3rd Qu.: 7.456 3rd Qu.:222.39
Max. :218.50 Max. :19.750 Max. :226.29
default_time payoff_time status_time
Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
Median :0.00000 Median :0.00000 Median :0.0000
Mean :0.02435 Mean :0.04271 Mean :0.1098
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.00000 Max. :2.0000
str(mortgage1)
Classes 'tbl_df', 'tbl' and 'data.frame': 622489 obs. of 22 variables:
$ time : num 25 26 27 28 29 30 31 32 33 34 ...
$ orig_time : num -7 -7 -7 -7 -7 -7 -7 -7 -7 -7 ...
$ first_time : num 25 25 25 25 25 25 25 25 25 25 ...
$ mat_time : num 113 113 113 113 113 113 113 113 113 113 ...
$ balance_time : num 41303 41062 40804 40484 40367 ...
$ LTV_time : num 24.5 24.5 24.6 24.7 24.9 ...
$ interest_rate_time : num 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 ...
$ hpi_time : num 226 225 222 220 217 ...
$ gdp_time : num 2.9 2.15 2.36 1.23 1.69 ...
$ uer_time : num 4.7 4.7 4.4 4.6 4.5 4.7 4.7 5 5 5.8 ...
$ REtype_CO_orig_time : num 0 0 0 0 0 0 0 0 0 0 ...
$ REtype_PU_orig_time : num 0 0 0 0 0 0 0 0 0 0 ...
$ REtype_SF_orig_time : num 1 1 1 1 1 1 1 1 1 1 ...
$ investor_orig_time : num 0 0 0 0 0 0 0 0 0 0 ...
$ balance_orig_time : num 45000 45000 45000 45000 45000 45000 45000 45000 45000 45000 ...
$ FICO_orig_time : num 715 715 715 715 715 715 715 715 715 715 ...
$ LTV_orig_time : num 69.4 69.4 69.4 69.4 69.4 69.4 69.4 69.4 69.4 69.4 ...
$ Interest_Rate_orig_time: num 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 ...
$ hpi_orig_time : num 87 87 87 87 87 ...
$ default_time : num 0 0 0 0 0 0 0 0 0 0 ...
$ payoff_time : num 0 0 0 0 0 0 0 0 0 0 ...
$ status_time : num 0 0 0 0 0 0 0 0 0 0 ...
mortgage1<-na.omit(mortgage1)
View(mortgage1)
Here there are 270 missing values available in the variable LTV_time which can be replaced by mean/median value or can be omitted,so here we are removing missing data.
boxplot(mortgage1)
From the boxplot plotted we have outliers in the data but we have certain extreme outliers which are far from the other outliers which has to be removed as it may effect the performance of the model.
mortgage1<-mortgage1[!mortgage1$balance_orig_time>6e+06,]
mortgage1<-mortgage1[!mortgage1$LTV_time>400,]
mortgage1<-mortgage1[!mortgage1$interest_rate_time>20,]
gdpout<-boxplot(mortgage1$gdp_time)$out
mortgage1<-mortgage1[! mortgage1$gdp_time %in% c(gdpout),]
mortgage1<-mortgage1[!mortgage1$LTV_orig_time>120,]
mortgage1<-mortgage1[!mortgage1$Interest_Rate_orig_time>15,]
mortgage1<-mortgage1[!mortgage1$Interest_Rate_orig_time==0,]
mortgage1<-mortgage1[!mortgage1$balance_orig_time>2300000,]
dim(mortgage1)
[1] 468317 22
boxplot(mortgage1)
So the dimension of the mortgage1 dataset is reduced to 468317 observations with 22 variables.
Now lets split the mortgage1 dataset into various datasets(here into one train data and 5 tests data as we have huge data)
library(caret)
Loading required package: lattice
Loading required package: ggplot2
splitdata<-createDataPartition(mortgage1$status_time,p=0.30,list = FALSE)
train<-mortgage1[splitdata,]
data1<-mortgage1[-splitdata,]
splitdata1<-createDataPartition(data1$status_time,p=0.20,list = FALSE)
test1<-data1[splitdata1,]
data2<-data1[-splitdata1,]
splitdata2<-createDataPartition(data2$status_time,p=0.25,list = FALSE)
test2<-data2[splitdata2,]
data3<-data2[-splitdata2,]
splitdata3<-createDataPartition(data3$status_time,p=0.30,list = FALSE)
test3<-data3[splitdata3,]
data4<-data3[-splitdata3,]
splitdata4<-createDataPartition(data4$status_time,p=0.40,list = FALSE)
test4<-data4[splitdata4,]
test5<-data4[-splitdata4,]
dim(train)
[1] 140496 22
dim(test1)
[1] 65565 22
dim(test2)
[1] 65564 22
dim(test3)
[1] 59008 22
dim(test4)
[1] 55074 22
dim(test5)
[1] 82610 22
The dataset is splitted to train data with 140496 observations,test1 data with 65565 observations,test2 data with 65564 observations,test3 data with 59008 observations,test4 data with 55074 observations and tets5 data with 82610 observations.
Now lets perform the model analysis on train data with certain models and selecting the best model out of it by comparing their performance.
library(nnet)
model_log<-multinom(status_time ~ ., data = train)
# weights: 69 (44 variable)
initial value 154350.632109
iter 10 value 73758.982173
iter 20 value 67036.810340
iter 30 value 24552.344610
iter 40 value 13976.641106
iter 50 value 131.795275
iter 60 value 18.496431
iter 70 value 5.625945
iter 80 value 3.239375
iter 90 value 2.125496
iter 100 value 0.900931
final value 0.900931
stopped after 100 iterations
model_log
Call:
multinom(formula = status_time ~ ., data = train)
Coefficients:
(Intercept) time orig_time first_time mat_time balance_time
1 -7.182890 -0.016759495 0.045898631 -0.07625945 0.04611751 2.890578e-05
2 -7.995852 -0.002865867 0.006291137 0.01074590 0.01400345 1.240231e-06
LTV_time interest_rate_time hpi_time gdp_time uer_time
1 0.01432979 -0.33483585 -0.04852568 0.164245195 -0.09971282
2 -0.01741873 -0.00493687 -0.01018085 -0.001638399 -0.13768944
REtype_CO_orig_time REtype_PU_orig_time REtype_SF_orig_time
1 -1.1961216 -0.1472076 -0.08401662
2 0.3241847 0.4445232 0.05584970
investor_orig_time balance_orig_time FICO_orig_time LTV_orig_time
1 -0.774734844 -4.001732e-05 0.007138301 -0.037759716
2 -0.003901038 -2.857548e-06 -0.001320203 -0.009110824
Interest_Rate_orig_time hpi_orig_time default_time payoff_time
1 -0.21944120 -0.0122722276 47.69307 9.455724
2 -0.02074202 -0.0003841193 14.15963 23.397839
Residual Deviance: 1.801862
AIC: 89.80186
Here in the multinominal logistic regression 100 iterations are performed and the iteration with least error is finalized and the intercepts and slopes of all the variables are shown above.
pred_train<-predict(model_log,train,type="class")
confusionMatrix(as.factor(train$status_time),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 132106 0 0
1 0 2737 0
2 0 0 5653
Overall Statistics
Accuracy : 1
95% CI : (1, 1)
No Information Rate : 0.9403
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2
Sensitivity 1.0000 1.00000 1.00000
Specificity 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.00000 1.00000
Prevalence 0.9403 0.01948 0.04024
Detection Rate 0.9403 0.01948 0.04024
Detection Prevalence 0.9403 0.01948 0.04024
Balanced Accuracy 1.0000 1.00000 1.00000
library(gmodels)
CrossTable(pred_train, train$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 140496
| actual values
predicted values | 0 | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
0 | 132106 | 0 | 0 | 132106 |
| 0.940 | 0.000 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
1 | 0 | 2737 | 0 | 2737 |
| 0.000 | 0.019 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
2 | 0 | 0 | 5653 | 5653 |
| 0.000 | 0.000 | 0.040 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 132106 | 2737 | 5653 | 140496 |
-----------------|-----------|-----------|-----------|-----------|
The performance of the model is at its best with optimum accuracy and all the values in train data are predicted correctly as ashown in the cross table.
library(rpart)
library(rpart.plot)
model_cart<-rpart(status_time ~ ., data=train)
model_cart$cptable
CP nsplit rel error xerror xstd
1 0.8880182 0 1.0000000 1.0000096 0.011159628
2 0.1119818 1 0.1119818 0.1119827 0.002075663
3 0.0100000 2 0.0000000 0.0000000 0.000000000
model_cart$variable.importance
payoff_time default_time LTV_time balance_time
21263.9144 2681.4453 210.6455 206.8840
rpart.plot(model_cart)
Here the model builts with complexity parameter of 0.88 and continues until the least error is achieved with complexity parameter of 0.01 and splits being 2 has no error.
The rplot is shown with payoff_time as the starting node and performing binary splits.
pred_train<-predict(model_cart,train)
confusionMatrix(as.factor(train$status_time),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 132106 0 0
1 0 2737 0
2 0 0 5653
Overall Statistics
Accuracy : 1
95% CI : (1, 1)
No Information Rate : 0.9403
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2
Sensitivity 1.0000 1.00000 1.00000
Specificity 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.00000 1.00000
Prevalence 0.9403 0.01948 0.04024
Detection Rate 0.9403 0.01948 0.04024
Detection Prevalence 0.9403 0.01948 0.04024
Balanced Accuracy 1.0000 1.00000 1.00000
library(gmodels)
CrossTable(pred_train, train$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 140496
| actual values
predicted values | 0 | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
0 | 132106 | 0 | 0 | 132106 |
| 0.940 | 0.000 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
1 | 0 | 2737 | 0 | 2737 |
| 0.000 | 0.019 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
2 | 0 | 0 | 5653 | 5653 |
| 0.000 | 0.000 | 0.040 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 132106 | 2737 | 5653 | 140496 |
-----------------|-----------|-----------|-----------|-----------|
Even here the model is processing with optimum accuracy and performance predicting all the train data values correctly.
library(xgboost)
Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':
slice
x<-train[,1:21]
y<-train[,22]
model_xgb<-xgboost(data=as.matrix(x),label = as.matrix(y),nrounds = 100)
[1] train-rmse:0.402417
[2] train-rmse:0.281699
[3] train-rmse:0.197194
[4] train-rmse:0.138040
[5] train-rmse:0.096630
[6] train-rmse:0.067643
[7] train-rmse:0.047351
[8] train-rmse:0.033147
[9] train-rmse:0.023203
[10] train-rmse:0.016243
[11] train-rmse:0.011370
[12] train-rmse:0.007959
[13] train-rmse:0.005572
[14] train-rmse:0.003900
[15] train-rmse:0.002730
[16] train-rmse:0.001911
[17] train-rmse:0.001338
[18] train-rmse:0.000937
[19] train-rmse:0.000656
[20] train-rmse:0.000459
[21] train-rmse:0.000321
[22] train-rmse:0.000225
[23] train-rmse:0.000157
[24] train-rmse:0.000110
[25] train-rmse:0.000077
[26] train-rmse:0.000054
[27] train-rmse:0.000038
[28] train-rmse:0.000026
[29] train-rmse:0.000019
[30] train-rmse:0.000013
[31] train-rmse:0.000009
[32] train-rmse:0.000007
[33] train-rmse:0.000005
[34] train-rmse:0.000004
[35] train-rmse:0.000003
[36] train-rmse:0.000003
[37] train-rmse:0.000003
[38] train-rmse:0.000003
[39] train-rmse:0.000003
[40] train-rmse:0.000003
[41] train-rmse:0.000003
[42] train-rmse:0.000003
[43] train-rmse:0.000003
[44] train-rmse:0.000003
[45] train-rmse:0.000003
[46] train-rmse:0.000003
[47] train-rmse:0.000003
[48] train-rmse:0.000003
[49] train-rmse:0.000003
[50] train-rmse:0.000003
[51] train-rmse:0.000003
[52] train-rmse:0.000003
[53] train-rmse:0.000003
[54] train-rmse:0.000003
[55] train-rmse:0.000003
[56] train-rmse:0.000003
[57] train-rmse:0.000003
[58] train-rmse:0.000003
[59] train-rmse:0.000003
[60] train-rmse:0.000003
[61] train-rmse:0.000003
[62] train-rmse:0.000003
[63] train-rmse:0.000003
[64] train-rmse:0.000003
[65] train-rmse:0.000003
[66] train-rmse:0.000003
[67] train-rmse:0.000003
[68] train-rmse:0.000003
[69] train-rmse:0.000003
[70] train-rmse:0.000003
[71] train-rmse:0.000003
[72] train-rmse:0.000003
[73] train-rmse:0.000003
[74] train-rmse:0.000003
[75] train-rmse:0.000003
[76] train-rmse:0.000003
[77] train-rmse:0.000003
[78] train-rmse:0.000003
[79] train-rmse:0.000003
[80] train-rmse:0.000003
[81] train-rmse:0.000003
[82] train-rmse:0.000003
[83] train-rmse:0.000003
[84] train-rmse:0.000003
[85] train-rmse:0.000003
[86] train-rmse:0.000003
[87] train-rmse:0.000003
[88] train-rmse:0.000003
[89] train-rmse:0.000003
[90] train-rmse:0.000003
[91] train-rmse:0.000003
[92] train-rmse:0.000003
[93] train-rmse:0.000003
[94] train-rmse:0.000003
[95] train-rmse:0.000003
[96] train-rmse:0.000003
[97] train-rmse:0.000003
[98] train-rmse:0.000003
[99] train-rmse:0.000003
[100] train-rmse:0.000003
Here the model performs 100 iterations where it takes misclassified values and makes it classsified by drawing a decision stump and for each train data root mean square error is calculated.
pred<-predict(model_xgb,as.matrix(x))
pred_train<-round(pred)
confusionMatrix(as.factor(train$status_time),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 132106 0 0
1 0 2737 0
2 0 0 5653
Overall Statistics
Accuracy : 1
95% CI : (1, 1)
No Information Rate : 0.9403
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2
Sensitivity 1.0000 1.00000 1.00000
Specificity 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.00000 1.00000
Prevalence 0.9403 0.01948 0.04024
Detection Rate 0.9403 0.01948 0.04024
Detection Prevalence 0.9403 0.01948 0.04024
Balanced Accuracy 1.0000 1.00000 1.00000
library(gmodels)
CrossTable(pred_train, train$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 140496
| actual values
predicted values | 0 | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
0 | 132106 | 0 | 0 | 132106 |
| 0.940 | 0.000 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
1 | 0 | 2737 | 0 | 2737 |
| 0.000 | 0.019 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
2 | 0 | 0 | 5653 | 5653 |
| 0.000 | 0.000 | 0.040 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 132106 | 2737 | 5653 | 140496 |
-----------------|-----------|-----------|-----------|-----------|
The research continues to perform well even for xtreme gradient boosting machine model.
From the above 3 models(multinominal logistic regression,CART decision tree model and xtreme gradient tree boosting machine model),all the models performed great so any model is chosen for further analysis,so lets choose multinominal logistic regression to predict the outcome on different tests data.
pred_test1<-predict(model_log,test1,type="class")
confusionMatrix(as.factor(test1$status_time),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 61554 0 0
1 0 1325 0
2 0 0 2686
Overall Statistics
Accuracy : 1
95% CI : (0.9999, 1)
No Information Rate : 0.9388
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2
Sensitivity 1.0000 1.00000 1.00000
Specificity 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.00000 1.00000
Prevalence 0.9388 0.02021 0.04097
Detection Rate 0.9388 0.02021 0.04097
Detection Prevalence 0.9388 0.02021 0.04097
Balanced Accuracy 1.0000 1.00000 1.00000
library(gmodels)
CrossTable(pred_test1, test1$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 65565
| actual values
predicted values | 0 | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
0 | 61554 | 0 | 0 | 61554 |
| 0.939 | 0.000 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
1 | 0 | 1325 | 0 | 1325 |
| 0.000 | 0.020 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
2 | 0 | 0 | 2686 | 2686 |
| 0.000 | 0.000 | 0.041 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 61554 | 1325 | 2686 | 65565 |
-----------------|-----------|-----------|-----------|-----------|
pred_test2<-predict(model_log,test2,type="class")
confusionMatrix(as.factor(test2$status_time),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 61547 0 0
1 0 1372 0
2 0 0 2645
Overall Statistics
Accuracy : 1
95% CI : (0.9999, 1)
No Information Rate : 0.9387
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2
Sensitivity 1.0000 1.00000 1.00000
Specificity 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.00000 1.00000
Prevalence 0.9387 0.02093 0.04034
Detection Rate 0.9387 0.02093 0.04034
Detection Prevalence 0.9387 0.02093 0.04034
Balanced Accuracy 1.0000 1.00000 1.00000
library(gmodels)
CrossTable(pred_test2, test2$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 65564
| actual values
predicted values | 0 | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
0 | 61547 | 0 | 0 | 61547 |
| 0.939 | 0.000 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
1 | 0 | 1372 | 0 | 1372 |
| 0.000 | 0.021 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
2 | 0 | 0 | 2645 | 2645 |
| 0.000 | 0.000 | 0.040 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 61547 | 1372 | 2645 | 65564 |
-----------------|-----------|-----------|-----------|-----------|
pred_test3<-predict(model_log,test3,type="class")
confusionMatrix(as.factor(test3$status_time),as.factor(pred_test3))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 55379 0 0
1 0 1241 0
2 0 0 2388
Overall Statistics
Accuracy : 1
95% CI : (0.9999, 1)
No Information Rate : 0.9385
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2
Sensitivity 1.0000 1.00000 1.00000
Specificity 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.00000 1.00000
Prevalence 0.9385 0.02103 0.04047
Detection Rate 0.9385 0.02103 0.04047
Detection Prevalence 0.9385 0.02103 0.04047
Balanced Accuracy 1.0000 1.00000 1.00000
library(gmodels)
CrossTable(pred_test3, test3$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 59008
| actual values
predicted values | 0 | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
0 | 55379 | 0 | 0 | 55379 |
| 0.938 | 0.000 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
1 | 0 | 1241 | 0 | 1241 |
| 0.000 | 0.021 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
2 | 0 | 0 | 2388 | 2388 |
| 0.000 | 0.000 | 0.040 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 55379 | 1241 | 2388 | 59008 |
-----------------|-----------|-----------|-----------|-----------|
pred_test4<-predict(model_log,test4,type="class")
confusionMatrix(as.factor(test4$status_time),as.factor(pred_test4))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 51639 0 0
1 0 1207 0
2 0 0 2228
Overall Statistics
Accuracy : 1
95% CI : (0.9999, 1)
No Information Rate : 0.9376
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2
Sensitivity 1.0000 1.00000 1.00000
Specificity 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.00000 1.00000
Prevalence 0.9376 0.02192 0.04045
Detection Rate 0.9376 0.02192 0.04045
Detection Prevalence 0.9376 0.02192 0.04045
Balanced Accuracy 1.0000 1.00000 1.00000
library(gmodels)
CrossTable(pred_test4, test4$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 55074
| actual values
predicted values | 0 | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
0 | 51639 | 0 | 0 | 51639 |
| 0.938 | 0.000 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
1 | 0 | 1207 | 0 | 1207 |
| 0.000 | 0.022 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
2 | 0 | 0 | 2228 | 2228 |
| 0.000 | 0.000 | 0.040 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 51639 | 1207 | 2228 | 55074 |
-----------------|-----------|-----------|-----------|-----------|
pred_test5<-predict(model_log,test5,type="class")
confusionMatrix(as.factor(test5$status_time),as.factor(pred_test5))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 77499 0 0
1 0 1724 0
2 0 0 3387
Overall Statistics
Accuracy : 1
95% CI : (1, 1)
No Information Rate : 0.9381
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2
Sensitivity 1.0000 1.00000 1.000
Specificity 1.0000 1.00000 1.000
Pos Pred Value 1.0000 1.00000 1.000
Neg Pred Value 1.0000 1.00000 1.000
Prevalence 0.9381 0.02087 0.041
Detection Rate 0.9381 0.02087 0.041
Detection Prevalence 0.9381 0.02087 0.041
Balanced Accuracy 1.0000 1.00000 1.000
library(gmodels)
CrossTable(pred_test5, test5$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 82610
| actual values
predicted values | 0 | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
0 | 77499 | 0 | 0 | 77499 |
| 0.938 | 0.000 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
1 | 0 | 1724 | 0 | 1724 |
| 0.000 | 0.021 | 0.000 | |
-----------------|-----------|-----------|-----------|-----------|
2 | 0 | 0 | 3387 | 3387 |
| 0.000 | 0.000 | 0.041 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 77499 | 1724 | 3387 | 82610 |
-----------------|-----------|-----------|-----------|-----------|
The multinominal logistic regression model performed very well on different combination of tests data too.
As all the models accuracy and other parameters are at optimal (i.e. accuracy,kappa,sensitivity,specificity and balanced accuracy are 100%) and the mutinominal logistic regression model performance is at its best for train data and for different combination of tests data, we cant rely on this model on this data, so i feel that there is a possibility of this model not performing well on future datasets (i.e. there is a equal chances of the data performing well and not performing well on the future given data) means more prone to overfit.Therefore, we need to test this multinominal logistic regression model on more such kind of data.