DEFINE PROBLEM STATEMENT

Can the status of a mortgage borrowers of U.S. residential’s be predicted at its respective observation time?

DATASET INFORMATION

The data set mortgage is in panel form and reports origination and performance observations for 50,000 residential U.S. mortgage borrowers are over 60 periods. The periods have been deidentified. As in the real world,loans may originate before the start of the observation period. The dataset is a randomized selection of mortgage-loan level data collected from the portfolios underlying U.S. residential mortgage backed securities securitization portfolios.

MORTGAGE DATASET (displaying only 13 observations)
id time orig_time first_time mat_time balance_time LTV_time interest_rate_time hpi_time gdp_time uer_time REtype_CO_orig_time REtype_PU_orig_time REtype_SF_orig_time investor_orig_time balance_orig_time FICO_orig_time LTV_orig_time Interest_Rate_orig_time hpi_orig_time default_time payoff_time status_time
1 25 -7 25 113 41303.42 24.49834 9.2 226.29 2.8991367 4.7 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 26 -7 25 113 41061.95 24.48387 9.2 225.10 2.1513649 4.7 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 27 -7 25 113 40804.42 24.62680 9.2 222.39 2.3617217 4.4 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 28 -7 25 113 40483.89 24.73588 9.2 219.67 1.2291722 4.6 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 29 -7 25 113 40367.06 24.92548 9.2 217.37 1.6929687 4.5 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 30 -7 25 113 40127.97 25.31829 9.2 212.73 2.2742178 4.7 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 31 -7 25 113 39718.66 26.56612 9.2 200.67 1.8506892 4.7 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 32 -7 25 113 35877.03 25.87256 9.2 186.12 1.1041628 5.0 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 33 -7 25 113 34410.03 25.58443 9.2 180.52 0.8368587 5.0 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 34 -7 25 113 33590.47 26.00807 9.2 173.35 -0.3144477 5.8 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 35 -7 25 113 32952.48 27.28650 9.2 162.09 -2.8058441 6.5 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 36 -7 25 113 32688.30 28.96363 9.2 151.48 -3.5165680 7.8 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0
1 37 -7 25 113 32388.30 28.34786 9.2 153.35 -4.1467109 9.0 0 0 1 0 45000 715 69.4 9.2 87.03 0 0 0

ATTRIBUTES DESCRIPTION

id: Borrower ID time: Time stamp of observation orig_time: Time stamp for origination first_time: Time stamp for first observation mat_time: Time stamp for maturity balance_time: Outstanding balance at observation time LTV_time: Loan-to-value ratio at observation time, in % interest_rate_time: Interest rate at observation time, in % hpi_time: House price index at observation time, base year = 100 gdp_time: Gross domestic product (GDP) growth at observation time, in % uer_time: Unemployment rate at observation time, in % REtype_CO_orig_time: Real estate type condominium = 1, otherwise = 0 REtype_PU_orig_time: Real estate type planned urban development = 1, otherwise = 0 REtype_SF_orig_time: Single-family home = 1, otherwise = 0 investor_orig_time: Investor borrower = 1, otherwise = 0 balance_orig_time: Outstanding balance at origination time FICO_orig_time: FICO score at origination time, in % LTV_orig_time: Loan-to-value ratio at origination time, in % Interest_Rate_orig_time: Interest rate at origination time, in % hpi_orig_time: House price index at origination time, base year = 100 default_time: Default observation at observation time payoff_time: Payoff observation at observation time status_time: Default (1), payoff (2), and nondefault/nonpayoff (0) observation at observation time

The status_time is an outcome variable whereas remaining variables will act as a explanatory variables but the variable ‘id’ is omitted as it is an customers/borrowers identity which will be insignificant for the model building.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
mortgage1<-select(mortgage1,-c("id"))

Check For Missing Values

sum(is.na(mortgage1))
[1] 270
summary(mortgage1)
      time        orig_time        first_time       mat_time    
 Min.   : 1.0   Min.   :-40.00   Min.   : 1.00   Min.   : 18.0  
 1st Qu.:27.0   1st Qu.: 18.00   1st Qu.:21.00   1st Qu.:137.0  
 Median :34.0   Median : 22.00   Median :25.00   Median :142.0  
 Mean   :35.8   Mean   : 20.57   Mean   :24.61   Mean   :137.2  
 3rd Qu.:44.0   3rd Qu.: 25.00   3rd Qu.:28.00   3rd Qu.:145.0  
 Max.   :60.0   Max.   : 60.00   Max.   :60.00   Max.   :229.0  
                                                                
  balance_time        LTV_time      interest_rate_time    hpi_time    
 Min.   :      0   Min.   :  0.00   Min.   : 0.000     Min.   :107.8  
 1st Qu.: 102017   1st Qu.: 67.11   1st Qu.: 5.650     1st Qu.:158.6  
 Median : 180618   Median : 82.25   Median : 6.625     Median :180.5  
 Mean   : 245965   Mean   : 83.08   Mean   : 6.702     Mean   :184.1  
 3rd Qu.: 337495   3rd Qu.:100.63   3rd Qu.: 7.875     3rd Qu.:212.7  
 Max.   :8701859   Max.   :803.51   Max.   :37.500     Max.   :226.3  
                   NA's   :270                                        
    gdp_time         uer_time      REtype_CO_orig_time REtype_PU_orig_time
 Min.   :-4.147   Min.   : 3.800   Min.   :0.0000      Min.   :0.0000     
 1st Qu.: 1.104   1st Qu.: 4.700   1st Qu.:0.0000      1st Qu.:0.0000     
 Median : 1.851   Median : 5.700   Median :0.0000      Median :0.0000     
 Mean   : 1.381   Mean   : 6.517   Mean   :0.0676      Mean   :0.1248     
 3rd Qu.: 2.694   3rd Qu.: 8.200   3rd Qu.:0.0000      3rd Qu.:0.0000     
 Max.   : 5.132   Max.   :10.000   Max.   :1.0000      Max.   :1.0000     
                                                                          
 REtype_SF_orig_time investor_orig_time balance_orig_time FICO_orig_time 
 Min.   :0.0000      Min.   :0.0000     Min.   :      0   Min.   :400.0  
 1st Qu.:0.0000      1st Qu.:0.0000     1st Qu.: 108000   1st Qu.:626.0  
 Median :1.0000      Median :0.0000     Median : 188000   Median :678.0  
 Mean   :0.6121      Mean   :0.1382     Mean   : 256254   Mean   :673.6  
 3rd Qu.:1.0000      3rd Qu.:0.0000     3rd Qu.: 352000   3rd Qu.:729.0  
 Max.   :1.0000      Max.   :1.0000     Max.   :8000000   Max.   :840.0  
                                                                         
 LTV_orig_time    Interest_Rate_orig_time hpi_orig_time   
 Min.   : 50.10   Min.   : 0.000          Min.   : 75.71  
 1st Qu.: 75.00   1st Qu.: 5.000          1st Qu.:179.45  
 Median : 80.00   Median : 6.290          Median :216.77  
 Mean   : 78.98   Mean   : 5.650          Mean   :198.12  
 3rd Qu.: 80.00   3rd Qu.: 7.456          3rd Qu.:222.39  
 Max.   :218.50   Max.   :19.750          Max.   :226.29  
                                                          
  default_time      payoff_time       status_time    
 Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :0.00000   Median :0.00000   Median :0.0000  
 Mean   :0.02435   Mean   :0.04271   Mean   :0.1098  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000  
 Max.   :1.00000   Max.   :1.00000   Max.   :2.0000  
                                                     
str(mortgage1)
Classes 'tbl_df', 'tbl' and 'data.frame':   622489 obs. of  22 variables:
 $ time                   : num  25 26 27 28 29 30 31 32 33 34 ...
 $ orig_time              : num  -7 -7 -7 -7 -7 -7 -7 -7 -7 -7 ...
 $ first_time             : num  25 25 25 25 25 25 25 25 25 25 ...
 $ mat_time               : num  113 113 113 113 113 113 113 113 113 113 ...
 $ balance_time           : num  41303 41062 40804 40484 40367 ...
 $ LTV_time               : num  24.5 24.5 24.6 24.7 24.9 ...
 $ interest_rate_time     : num  9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 ...
 $ hpi_time               : num  226 225 222 220 217 ...
 $ gdp_time               : num  2.9 2.15 2.36 1.23 1.69 ...
 $ uer_time               : num  4.7 4.7 4.4 4.6 4.5 4.7 4.7 5 5 5.8 ...
 $ REtype_CO_orig_time    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ REtype_PU_orig_time    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ REtype_SF_orig_time    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ investor_orig_time     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ balance_orig_time      : num  45000 45000 45000 45000 45000 45000 45000 45000 45000 45000 ...
 $ FICO_orig_time         : num  715 715 715 715 715 715 715 715 715 715 ...
 $ LTV_orig_time          : num  69.4 69.4 69.4 69.4 69.4 69.4 69.4 69.4 69.4 69.4 ...
 $ Interest_Rate_orig_time: num  9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 9.2 ...
 $ hpi_orig_time          : num  87 87 87 87 87 ...
 $ default_time           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ payoff_time            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ status_time            : num  0 0 0 0 0 0 0 0 0 0 ...
mortgage1<-na.omit(mortgage1)
View(mortgage1)
Here there are 270 missing values available in the variable LTV_time which can be replaced by mean/median value or can be omitted,so here we are removing missing data.

Check For Outliers

boxplot(mortgage1)

From the boxplot plotted we have outliers in the data but we have certain extreme outliers which are far from the other outliers which has to be removed as it may effect the performance of the model.

Remove Extreme Outliers

mortgage1<-mortgage1[!mortgage1$balance_orig_time>6e+06,]
mortgage1<-mortgage1[!mortgage1$LTV_time>400,]
mortgage1<-mortgage1[!mortgage1$interest_rate_time>20,]
gdpout<-boxplot(mortgage1$gdp_time)$out

mortgage1<-mortgage1[! mortgage1$gdp_time %in% c(gdpout),]
mortgage1<-mortgage1[!mortgage1$LTV_orig_time>120,]
mortgage1<-mortgage1[!mortgage1$Interest_Rate_orig_time>15,]
mortgage1<-mortgage1[!mortgage1$Interest_Rate_orig_time==0,]
mortgage1<-mortgage1[!mortgage1$balance_orig_time>2300000,]
dim(mortgage1)
[1] 468317     22
boxplot(mortgage1)

So the dimension of the mortgage1 dataset is reduced to 468317 observations with 22 variables.

Now lets split the mortgage1 dataset into various datasets(here into one train data and 5 tests data as we have huge data)

RESAMPLING THE DATA

library(caret)
Loading required package: lattice
Loading required package: ggplot2
splitdata<-createDataPartition(mortgage1$status_time,p=0.30,list = FALSE)
train<-mortgage1[splitdata,]
data1<-mortgage1[-splitdata,]

splitdata1<-createDataPartition(data1$status_time,p=0.20,list = FALSE)
test1<-data1[splitdata1,]
data2<-data1[-splitdata1,]

splitdata2<-createDataPartition(data2$status_time,p=0.25,list = FALSE)
test2<-data2[splitdata2,]
data3<-data2[-splitdata2,]

splitdata3<-createDataPartition(data3$status_time,p=0.30,list = FALSE)
test3<-data3[splitdata3,]
data4<-data3[-splitdata3,]

splitdata4<-createDataPartition(data4$status_time,p=0.40,list = FALSE)
test4<-data4[splitdata4,]
test5<-data4[-splitdata4,]


dim(train)
[1] 140496     22
dim(test1)
[1] 65565    22
dim(test2)
[1] 65564    22
dim(test3)
[1] 59008    22
dim(test4)
[1] 55074    22
dim(test5)
[1] 82610    22
The dataset is splitted to train data with 140496 observations,test1 data with 65565 observations,test2 data with 65564 observations,test3 data with 59008 observations,test4 data with 55074 observations and tets5 data with 82610 observations.
Now lets perform the model analysis on train data with certain models and selecting the best model out of it by comparing their performance.

MULTINOMINAL LOGISTIC REGRESSION

MODEL BUILDING

library(nnet)
model_log<-multinom(status_time ~ ., data = train)
# weights:  69 (44 variable)
initial  value 154350.632109 
iter  10 value 73758.982173
iter  20 value 67036.810340
iter  30 value 24552.344610
iter  40 value 13976.641106
iter  50 value 131.795275
iter  60 value 18.496431
iter  70 value 5.625945
iter  80 value 3.239375
iter  90 value 2.125496
iter 100 value 0.900931
final  value 0.900931 
stopped after 100 iterations
model_log
Call:
multinom(formula = status_time ~ ., data = train)

Coefficients:
  (Intercept)         time   orig_time  first_time   mat_time balance_time
1   -7.182890 -0.016759495 0.045898631 -0.07625945 0.04611751 2.890578e-05
2   -7.995852 -0.002865867 0.006291137  0.01074590 0.01400345 1.240231e-06
     LTV_time interest_rate_time    hpi_time     gdp_time    uer_time
1  0.01432979        -0.33483585 -0.04852568  0.164245195 -0.09971282
2 -0.01741873        -0.00493687 -0.01018085 -0.001638399 -0.13768944
  REtype_CO_orig_time REtype_PU_orig_time REtype_SF_orig_time
1          -1.1961216          -0.1472076         -0.08401662
2           0.3241847           0.4445232          0.05584970
  investor_orig_time balance_orig_time FICO_orig_time LTV_orig_time
1       -0.774734844     -4.001732e-05    0.007138301  -0.037759716
2       -0.003901038     -2.857548e-06   -0.001320203  -0.009110824
  Interest_Rate_orig_time hpi_orig_time default_time payoff_time
1             -0.21944120 -0.0122722276     47.69307    9.455724
2             -0.02074202 -0.0003841193     14.15963   23.397839

Residual Deviance: 1.801862 
AIC: 89.80186 
Here in the multinominal logistic regression 100 iterations are performed and the iteration with least error is finalized and the intercepts and slopes of all the variables are shown above.

PREDICT ON TRAIN DATA

pred_train<-predict(model_log,train,type="class")
confusionMatrix(as.factor(train$status_time),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction      0      1      2
         0 132106      0      0
         1      0   2737      0
         2      0      0   5653

Overall Statistics
                                   
               Accuracy : 1        
                 95% CI : (1, 1)   
    No Information Rate : 0.9403   
    P-Value [Acc > NIR] : < 2.2e-16
                                   
                  Kappa : 1        
 Mcnemar's Test P-Value : NA       

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000  1.00000  1.00000
Specificity            1.0000  1.00000  1.00000
Pos Pred Value         1.0000  1.00000  1.00000
Neg Pred Value         1.0000  1.00000  1.00000
Prevalence             0.9403  0.01948  0.04024
Detection Rate         0.9403  0.01948  0.04024
Detection Prevalence   0.9403  0.01948  0.04024
Balanced Accuracy      1.0000  1.00000  1.00000
library(gmodels)
CrossTable(pred_train, train$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  140496 

 
                 | actual values 
predicted values |         0 |         1 |         2 | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               0 |    132106 |         0 |         0 |    132106 | 
                 |     0.940 |     0.000 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               1 |         0 |      2737 |         0 |      2737 | 
                 |     0.000 |     0.019 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               2 |         0 |         0 |      5653 |      5653 | 
                 |     0.000 |     0.000 |     0.040 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |    132106 |      2737 |      5653 |    140496 | 
-----------------|-----------|-----------|-----------|-----------|

 
  The performance of the model is at its best with optimum accuracy and all the values in train data are predicted correctly as ashown in the cross table.

CART DECISION TREE

MODEL BUILDING

library(rpart)
library(rpart.plot)
model_cart<-rpart(status_time ~ ., data=train)
model_cart$cptable
         CP nsplit rel error    xerror        xstd
1 0.8880182      0 1.0000000 1.0000096 0.011159628
2 0.1119818      1 0.1119818 0.1119827 0.002075663
3 0.0100000      2 0.0000000 0.0000000 0.000000000
model_cart$variable.importance
 payoff_time default_time     LTV_time balance_time 
  21263.9144    2681.4453     210.6455     206.8840 
rpart.plot(model_cart)

Here the model builts with complexity parameter of 0.88 and continues until the least error is achieved with complexity parameter of 0.01 and splits being 2 has no error.
The rplot is shown with payoff_time as the starting node and performing binary splits.

PREDICT ON TRAIN DATA

pred_train<-predict(model_cart,train)
confusionMatrix(as.factor(train$status_time),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction      0      1      2
         0 132106      0      0
         1      0   2737      0
         2      0      0   5653

Overall Statistics
                                   
               Accuracy : 1        
                 95% CI : (1, 1)   
    No Information Rate : 0.9403   
    P-Value [Acc > NIR] : < 2.2e-16
                                   
                  Kappa : 1        
 Mcnemar's Test P-Value : NA       

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000  1.00000  1.00000
Specificity            1.0000  1.00000  1.00000
Pos Pred Value         1.0000  1.00000  1.00000
Neg Pred Value         1.0000  1.00000  1.00000
Prevalence             0.9403  0.01948  0.04024
Detection Rate         0.9403  0.01948  0.04024
Detection Prevalence   0.9403  0.01948  0.04024
Balanced Accuracy      1.0000  1.00000  1.00000
library(gmodels)
CrossTable(pred_train, train$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  140496 

 
                 | actual values 
predicted values |         0 |         1 |         2 | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               0 |    132106 |         0 |         0 |    132106 | 
                 |     0.940 |     0.000 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               1 |         0 |      2737 |         0 |      2737 | 
                 |     0.000 |     0.019 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               2 |         0 |         0 |      5653 |      5653 | 
                 |     0.000 |     0.000 |     0.040 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |    132106 |      2737 |      5653 |    140496 | 
-----------------|-----------|-----------|-----------|-----------|

 
Even here the model is processing with optimum accuracy and performance predicting all the train data values correctly.

XTREME GRADIENT BOOSTING MACHINE

MODEL BUILDING

library(xgboost)

Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':

    slice
x<-train[,1:21]
y<-train[,22]
model_xgb<-xgboost(data=as.matrix(x),label = as.matrix(y),nrounds = 100)
[1] train-rmse:0.402417 
[2] train-rmse:0.281699 
[3] train-rmse:0.197194 
[4] train-rmse:0.138040 
[5] train-rmse:0.096630 
[6] train-rmse:0.067643 
[7] train-rmse:0.047351 
[8] train-rmse:0.033147 
[9] train-rmse:0.023203 
[10]    train-rmse:0.016243 
[11]    train-rmse:0.011370 
[12]    train-rmse:0.007959 
[13]    train-rmse:0.005572 
[14]    train-rmse:0.003900 
[15]    train-rmse:0.002730 
[16]    train-rmse:0.001911 
[17]    train-rmse:0.001338 
[18]    train-rmse:0.000937 
[19]    train-rmse:0.000656 
[20]    train-rmse:0.000459 
[21]    train-rmse:0.000321 
[22]    train-rmse:0.000225 
[23]    train-rmse:0.000157 
[24]    train-rmse:0.000110 
[25]    train-rmse:0.000077 
[26]    train-rmse:0.000054 
[27]    train-rmse:0.000038 
[28]    train-rmse:0.000026 
[29]    train-rmse:0.000019 
[30]    train-rmse:0.000013 
[31]    train-rmse:0.000009 
[32]    train-rmse:0.000007 
[33]    train-rmse:0.000005 
[34]    train-rmse:0.000004 
[35]    train-rmse:0.000003 
[36]    train-rmse:0.000003 
[37]    train-rmse:0.000003 
[38]    train-rmse:0.000003 
[39]    train-rmse:0.000003 
[40]    train-rmse:0.000003 
[41]    train-rmse:0.000003 
[42]    train-rmse:0.000003 
[43]    train-rmse:0.000003 
[44]    train-rmse:0.000003 
[45]    train-rmse:0.000003 
[46]    train-rmse:0.000003 
[47]    train-rmse:0.000003 
[48]    train-rmse:0.000003 
[49]    train-rmse:0.000003 
[50]    train-rmse:0.000003 
[51]    train-rmse:0.000003 
[52]    train-rmse:0.000003 
[53]    train-rmse:0.000003 
[54]    train-rmse:0.000003 
[55]    train-rmse:0.000003 
[56]    train-rmse:0.000003 
[57]    train-rmse:0.000003 
[58]    train-rmse:0.000003 
[59]    train-rmse:0.000003 
[60]    train-rmse:0.000003 
[61]    train-rmse:0.000003 
[62]    train-rmse:0.000003 
[63]    train-rmse:0.000003 
[64]    train-rmse:0.000003 
[65]    train-rmse:0.000003 
[66]    train-rmse:0.000003 
[67]    train-rmse:0.000003 
[68]    train-rmse:0.000003 
[69]    train-rmse:0.000003 
[70]    train-rmse:0.000003 
[71]    train-rmse:0.000003 
[72]    train-rmse:0.000003 
[73]    train-rmse:0.000003 
[74]    train-rmse:0.000003 
[75]    train-rmse:0.000003 
[76]    train-rmse:0.000003 
[77]    train-rmse:0.000003 
[78]    train-rmse:0.000003 
[79]    train-rmse:0.000003 
[80]    train-rmse:0.000003 
[81]    train-rmse:0.000003 
[82]    train-rmse:0.000003 
[83]    train-rmse:0.000003 
[84]    train-rmse:0.000003 
[85]    train-rmse:0.000003 
[86]    train-rmse:0.000003 
[87]    train-rmse:0.000003 
[88]    train-rmse:0.000003 
[89]    train-rmse:0.000003 
[90]    train-rmse:0.000003 
[91]    train-rmse:0.000003 
[92]    train-rmse:0.000003 
[93]    train-rmse:0.000003 
[94]    train-rmse:0.000003 
[95]    train-rmse:0.000003 
[96]    train-rmse:0.000003 
[97]    train-rmse:0.000003 
[98]    train-rmse:0.000003 
[99]    train-rmse:0.000003 
[100]   train-rmse:0.000003 
Here the model performs 100 iterations where it takes misclassified values and makes it classsified by drawing a decision stump and for each train data root mean square error is calculated.

PREDICT ON TRAIN DATA

pred<-predict(model_xgb,as.matrix(x))
pred_train<-round(pred)
confusionMatrix(as.factor(train$status_time),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction      0      1      2
         0 132106      0      0
         1      0   2737      0
         2      0      0   5653

Overall Statistics
                                   
               Accuracy : 1        
                 95% CI : (1, 1)   
    No Information Rate : 0.9403   
    P-Value [Acc > NIR] : < 2.2e-16
                                   
                  Kappa : 1        
 Mcnemar's Test P-Value : NA       

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000  1.00000  1.00000
Specificity            1.0000  1.00000  1.00000
Pos Pred Value         1.0000  1.00000  1.00000
Neg Pred Value         1.0000  1.00000  1.00000
Prevalence             0.9403  0.01948  0.04024
Detection Rate         0.9403  0.01948  0.04024
Detection Prevalence   0.9403  0.01948  0.04024
Balanced Accuracy      1.0000  1.00000  1.00000
library(gmodels)
CrossTable(pred_train, train$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  140496 

 
                 | actual values 
predicted values |         0 |         1 |         2 | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               0 |    132106 |         0 |         0 |    132106 | 
                 |     0.940 |     0.000 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               1 |         0 |      2737 |         0 |      2737 | 
                 |     0.000 |     0.019 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               2 |         0 |         0 |      5653 |      5653 | 
                 |     0.000 |     0.000 |     0.040 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |    132106 |      2737 |      5653 |    140496 | 
-----------------|-----------|-----------|-----------|-----------|

 
The research continues to perform well even for xtreme gradient boosting machine model.

From the above 3 models(multinominal logistic regression,CART decision tree model and xtreme gradient tree boosting machine model),all the models performed great so any model is chosen for further analysis,so lets choose multinominal logistic regression to predict the outcome on different tests data.

PREDICT ON DIFFERENT TESTS DATA

USING MUTINOMINAL LOGISTIC REGRESSION

PREDICT ON TEST1 DATA

pred_test1<-predict(model_log,test1,type="class")
confusionMatrix(as.factor(test1$status_time),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction     0     1     2
         0 61554     0     0
         1     0  1325     0
         2     0     0  2686

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9999, 1)
    No Information Rate : 0.9388     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000  1.00000  1.00000
Specificity            1.0000  1.00000  1.00000
Pos Pred Value         1.0000  1.00000  1.00000
Neg Pred Value         1.0000  1.00000  1.00000
Prevalence             0.9388  0.02021  0.04097
Detection Rate         0.9388  0.02021  0.04097
Detection Prevalence   0.9388  0.02021  0.04097
Balanced Accuracy      1.0000  1.00000  1.00000
library(gmodels)
CrossTable(pred_test1, test1$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  65565 

 
                 | actual values 
predicted values |         0 |         1 |         2 | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               0 |     61554 |         0 |         0 |     61554 | 
                 |     0.939 |     0.000 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               1 |         0 |      1325 |         0 |      1325 | 
                 |     0.000 |     0.020 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               2 |         0 |         0 |      2686 |      2686 | 
                 |     0.000 |     0.000 |     0.041 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |     61554 |      1325 |      2686 |     65565 | 
-----------------|-----------|-----------|-----------|-----------|

 

PREDICT ON TEST2 DATA

pred_test2<-predict(model_log,test2,type="class")
confusionMatrix(as.factor(test2$status_time),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction     0     1     2
         0 61547     0     0
         1     0  1372     0
         2     0     0  2645

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9999, 1)
    No Information Rate : 0.9387     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000  1.00000  1.00000
Specificity            1.0000  1.00000  1.00000
Pos Pred Value         1.0000  1.00000  1.00000
Neg Pred Value         1.0000  1.00000  1.00000
Prevalence             0.9387  0.02093  0.04034
Detection Rate         0.9387  0.02093  0.04034
Detection Prevalence   0.9387  0.02093  0.04034
Balanced Accuracy      1.0000  1.00000  1.00000
library(gmodels)
CrossTable(pred_test2, test2$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  65564 

 
                 | actual values 
predicted values |         0 |         1 |         2 | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               0 |     61547 |         0 |         0 |     61547 | 
                 |     0.939 |     0.000 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               1 |         0 |      1372 |         0 |      1372 | 
                 |     0.000 |     0.021 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               2 |         0 |         0 |      2645 |      2645 | 
                 |     0.000 |     0.000 |     0.040 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |     61547 |      1372 |      2645 |     65564 | 
-----------------|-----------|-----------|-----------|-----------|

 

PREDICT ON TEST3 DATA

pred_test3<-predict(model_log,test3,type="class")
confusionMatrix(as.factor(test3$status_time),as.factor(pred_test3))
Confusion Matrix and Statistics

          Reference
Prediction     0     1     2
         0 55379     0     0
         1     0  1241     0
         2     0     0  2388

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9999, 1)
    No Information Rate : 0.9385     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000  1.00000  1.00000
Specificity            1.0000  1.00000  1.00000
Pos Pred Value         1.0000  1.00000  1.00000
Neg Pred Value         1.0000  1.00000  1.00000
Prevalence             0.9385  0.02103  0.04047
Detection Rate         0.9385  0.02103  0.04047
Detection Prevalence   0.9385  0.02103  0.04047
Balanced Accuracy      1.0000  1.00000  1.00000
library(gmodels)
CrossTable(pred_test3, test3$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  59008 

 
                 | actual values 
predicted values |         0 |         1 |         2 | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               0 |     55379 |         0 |         0 |     55379 | 
                 |     0.938 |     0.000 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               1 |         0 |      1241 |         0 |      1241 | 
                 |     0.000 |     0.021 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               2 |         0 |         0 |      2388 |      2388 | 
                 |     0.000 |     0.000 |     0.040 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |     55379 |      1241 |      2388 |     59008 | 
-----------------|-----------|-----------|-----------|-----------|

 

PREDICT ON TEST4 DATA

pred_test4<-predict(model_log,test4,type="class")
confusionMatrix(as.factor(test4$status_time),as.factor(pred_test4))
Confusion Matrix and Statistics

          Reference
Prediction     0     1     2
         0 51639     0     0
         1     0  1207     0
         2     0     0  2228

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9999, 1)
    No Information Rate : 0.9376     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000  1.00000  1.00000
Specificity            1.0000  1.00000  1.00000
Pos Pred Value         1.0000  1.00000  1.00000
Neg Pred Value         1.0000  1.00000  1.00000
Prevalence             0.9376  0.02192  0.04045
Detection Rate         0.9376  0.02192  0.04045
Detection Prevalence   0.9376  0.02192  0.04045
Balanced Accuracy      1.0000  1.00000  1.00000
library(gmodels)
CrossTable(pred_test4, test4$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  55074 

 
                 | actual values 
predicted values |         0 |         1 |         2 | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               0 |     51639 |         0 |         0 |     51639 | 
                 |     0.938 |     0.000 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               1 |         0 |      1207 |         0 |      1207 | 
                 |     0.000 |     0.022 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               2 |         0 |         0 |      2228 |      2228 | 
                 |     0.000 |     0.000 |     0.040 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |     51639 |      1207 |      2228 |     55074 | 
-----------------|-----------|-----------|-----------|-----------|

 

PREDICT ON TEST5 DATA

pred_test5<-predict(model_log,test5,type="class")
confusionMatrix(as.factor(test5$status_time),as.factor(pred_test5))
Confusion Matrix and Statistics

          Reference
Prediction     0     1     2
         0 77499     0     0
         1     0  1724     0
         2     0     0  3387

Overall Statistics
                                   
               Accuracy : 1        
                 95% CI : (1, 1)   
    No Information Rate : 0.9381   
    P-Value [Acc > NIR] : < 2.2e-16
                                   
                  Kappa : 1        
 Mcnemar's Test P-Value : NA       

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000  1.00000    1.000
Specificity            1.0000  1.00000    1.000
Pos Pred Value         1.0000  1.00000    1.000
Neg Pred Value         1.0000  1.00000    1.000
Prevalence             0.9381  0.02087    0.041
Detection Rate         0.9381  0.02087    0.041
Detection Prevalence   0.9381  0.02087    0.041
Balanced Accuracy      1.0000  1.00000    1.000
library(gmodels)
CrossTable(pred_test5, test5$status_time, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('predicted values','actual values'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  82610 

 
                 | actual values 
predicted values |         0 |         1 |         2 | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               0 |     77499 |         0 |         0 |     77499 | 
                 |     0.938 |     0.000 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               1 |         0 |      1724 |         0 |      1724 | 
                 |     0.000 |     0.021 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
               2 |         0 |         0 |      3387 |      3387 | 
                 |     0.000 |     0.000 |     0.041 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |     77499 |      1724 |      3387 |     82610 | 
-----------------|-----------|-----------|-----------|-----------|

 
The multinominal logistic regression model performed very well on different combination of tests data too.

CONCLUSION

As all the models accuracy and other parameters are at optimal (i.e. accuracy,kappa,sensitivity,specificity and balanced accuracy are 100%) and the mutinominal logistic regression model performance is at its best for train data and for different combination of tests data, we cant rely on this model on this data, so i feel that there is a possibility of this model not performing well on future datasets (i.e. there is a equal chances of the data performing well and not performing well on the future given data) means more prone to overfit.Therefore, we need to test this multinominal logistic regression model on more such kind of data.