library(readxl)

## Warning: package 'readxl' was built under R version 3.5.3

ctg <- read_xlsx("F:/Machine Learning/Data Science/Machine Learning/Ordinal/CTG.xlsx")

head(ctg)

## # A tibble: 6 x 22
##      LB      AC    FM      UC      DL    DS      DP  ASTV  MSTV  ALTV  MLTV
##   <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   120 0           0 0       0           0 0          73   0.5    43   2.4
## 2   132 0.00638     0 0.00638 0.00319     0 0          17   2.1     0  10.4
## 3   133 0.00332     0 0.00831 0.00332     0 0          16   2.1     0  13.4
## 4   134 0.00256     0 0.00768 0.00256     0 0          16   2.4     0  23  
## 5   132 0.00651     0 0.00814 0           0 0          16   2.4     0  19.9
## 6   134 0.00105     0 0.0105  0.00944     0 0.00210    26   5.9     0   0  
## # ... with 11 more variables: Width <dbl>, Min <dbl>, Max <dbl>,
## #   Nmax <dbl>, Nzeros <dbl>, Mode <dbl>, Mean <dbl>, Median <dbl>,
## #   Variance <dbl>, Tendency <dbl>, NSP <dbl>

str(ctg)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2126 obs. of  22 variables:
##  $ LB      : num  120 132 133 134 132 134 134 122 122 122 ...
##  $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
##  $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
##  $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
##  $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DP      : num  0 0 0 0 0 ...
##  $ ASTV    : num  73 17 16 16 16 26 29 83 84 86 ...
##  $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
##  $ ALTV    : num  43 0 0 0 0 0 0 6 5 6 ...
##  $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
##  $ Width   : num  64 130 130 117 117 150 150 68 68 68 ...
##  $ Min     : num  62 68 68 53 53 50 50 62 62 62 ...
##  $ Max     : num  126 198 198 170 170 200 200 130 130 130 ...
##  $ Nmax    : num  2 6 5 11 9 5 6 0 0 1 ...
##  $ Nzeros  : num  0 1 1 0 0 3 3 0 0 0 ...
##  $ Mode    : num  120 141 141 137 137 76 71 122 122 122 ...
##  $ Mean    : num  137 136 135 134 136 107 107 122 122 122 ...
##  $ Median  : num  121 140 138 137 138 107 106 123 123 123 ...
##  $ Variance: num  73 12 13 13 11 170 215 3 3 1 ...
##  $ Tendency: num  1 0 0 1 1 0 0 1 1 1 ...
##  $ NSP     : num  2 1 1 1 1 3 3 3 3 3 ...

Target variable NSP has three classes:

1 means normal 2 means suspect 3 means pathologic

#convert NSP into an ordinal variable

ctg$NSP <- as.ordered(ctg$NSP)

str(ctg)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2126 obs. of  22 variables:
##  $ LB      : num  120 132 133 134 132 134 134 122 122 122 ...
##  $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
##  $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
##  $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
##  $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DP      : num  0 0 0 0 0 ...
##  $ ASTV    : num  73 17 16 16 16 26 29 83 84 86 ...
##  $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
##  $ ALTV    : num  43 0 0 0 0 0 0 6 5 6 ...
##  $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
##  $ Width   : num  64 130 130 117 117 150 150 68 68 68 ...
##  $ Min     : num  62 68 68 53 53 50 50 62 62 62 ...
##  $ Max     : num  126 198 198 170 170 200 200 130 130 130 ...
##  $ Nmax    : num  2 6 5 11 9 5 6 0 0 1 ...
##  $ Nzeros  : num  0 1 1 0 0 3 3 0 0 0 ...
##  $ Mode    : num  120 141 141 137 137 76 71 122 122 122 ...
##  $ Mean    : num  137 136 135 134 136 107 107 122 122 122 ...
##  $ Median  : num  121 140 138 137 138 107 106 123 123 123 ...
##  $ Variance: num  73 12 13 13 11 170 215 3 3 1 ...
##  $ Tendency: num  1 0 0 1 1 0 0 1 1 1 ...
##  $ NSP     : Ord.factor w/ 3 levels "1"<"2"<"3": 2 1 1 1 1 3 3 3 3 3 ...

Now NSP variable got converted into ordinal variable where 1 is less than 2, 2 is less than 3, and 3 being the worst condition.

summary(ctg)

##        LB              AC                 FM                 UC          
##  Min.   :106.0   Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:126.0   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.001876  
##  Median :133.0   Median :0.001630   Median :0.000000   Median :0.004482  
##  Mean   :133.3   Mean   :0.003170   Mean   :0.009474   Mean   :0.004357  
##  3rd Qu.:140.0   3rd Qu.:0.005631   3rd Qu.:0.002512   3rd Qu.:0.006525  
##  Max.   :160.0   Max.   :0.019284   Max.   :0.480634   Max.   :0.014925  
##        DL                 DS                  DP           
##  Min.   :0.000000   Min.   :0.000e+00   Min.   :0.0000000  
##  1st Qu.:0.000000   1st Qu.:0.000e+00   1st Qu.:0.0000000  
##  Median :0.000000   Median :0.000e+00   Median :0.0000000  
##  Mean   :0.001885   Mean   :3.585e-06   Mean   :0.0001566  
##  3rd Qu.:0.003264   3rd Qu.:0.000e+00   3rd Qu.:0.0000000  
##  Max.   :0.015385   Max.   :1.353e-03   Max.   :0.0053476  
##       ASTV            MSTV            ALTV             MLTV       
##  Min.   :12.00   Min.   :0.200   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:32.00   1st Qu.:0.700   1st Qu.: 0.000   1st Qu.: 4.600  
##  Median :49.00   Median :1.200   Median : 0.000   Median : 7.400  
##  Mean   :46.99   Mean   :1.333   Mean   : 9.847   Mean   : 8.188  
##  3rd Qu.:61.00   3rd Qu.:1.700   3rd Qu.:11.000   3rd Qu.:10.800  
##  Max.   :87.00   Max.   :7.000   Max.   :91.000   Max.   :50.700  
##      Width             Min              Max           Nmax       
##  Min.   :  3.00   Min.   : 50.00   Min.   :122   Min.   : 0.000  
##  1st Qu.: 37.00   1st Qu.: 67.00   1st Qu.:152   1st Qu.: 2.000  
##  Median : 67.50   Median : 93.00   Median :162   Median : 3.000  
##  Mean   : 70.45   Mean   : 93.58   Mean   :164   Mean   : 4.068  
##  3rd Qu.:100.00   3rd Qu.:120.00   3rd Qu.:174   3rd Qu.: 6.000  
##  Max.   :180.00   Max.   :159.00   Max.   :238   Max.   :18.000  
##      Nzeros             Mode            Mean           Median     
##  Min.   : 0.0000   Min.   : 60.0   Min.   : 73.0   Min.   : 77.0  
##  1st Qu.: 0.0000   1st Qu.:129.0   1st Qu.:125.0   1st Qu.:129.0  
##  Median : 0.0000   Median :139.0   Median :136.0   Median :139.0  
##  Mean   : 0.3236   Mean   :137.5   Mean   :134.6   Mean   :138.1  
##  3rd Qu.: 0.0000   3rd Qu.:148.0   3rd Qu.:145.0   3rd Qu.:148.0  
##  Max.   :10.0000   Max.   :187.0   Max.   :182.0   Max.   :186.0  
##     Variance         Tendency       NSP     
##  Min.   :  0.00   Min.   :-1.0000   1:1655  
##  1st Qu.:  2.00   1st Qu.: 0.0000   2: 295  
##  Median :  7.00   Median : 0.0000   3: 176  
##  Mean   : 18.81   Mean   : 0.3203           
##  3rd Qu.: 24.00   3rd Qu.: 1.0000           
##  Max.   :269.00   Max.   : 1.0000

Out of 21 variables, variable Tendency takes only three values, -1, 0, 1.

Also, convert Tendency variable into a categorical variable.

#Convert tendency into categorical variable 

ctg$Tendency <- as.factor(ctg$Tendency)

Now to cross-tabulate NSP with Tendency.

#cross-tabulation

xtabs(~NSP + Tendency, ctg)

##    Tendency
## NSP  -1   0   1
##   1 101 887 667
##   2  15 137 143
##   3  49  91  36

Cross tabulation shows that we have sufficient size of frequency in each cell

data partition

We are going to partition the data into two parts, 80% for training, 20% for testing.

#Data partition

ind <- sample(2, nrow(ctg), replace = TRUE, prob = c(.8,.2))

ctg.train <- ctg[ind==1, ]
ctg.test <- ctg[ind==2, ]

dim(ctg.train)

## [1] 1673   22

dim(ctg.test)

## [1] 453  22

Ordinal logistic regression or Proportional Odds Logistic Regression model

library(MASS)

#polr stands for Proportional Odds Logistic Regression model
#building with only three variables just to illustrate 
#Hess = T, is to make sure to get standard errors

train.m <- polr(NSP ~ LB+AC+FM, ctg.train, Hess = TRUE)

summary(train.m)

## Call:
## polr(formula = NSP ~ LB + AC + FM, data = ctg.train, Hess = TRUE)
## 
## Coefficients:
##         Value Std. Error    t value
## LB    0.03903   0.006717  5.811e+00
## AC -821.35554   0.004881 -1.683e+05
## FM    7.03879   1.242086  5.667e+00
## 
## Intercepts:
##     Value        Std. Error   t value     
## 1|2       5.6165       0.9217       6.0933
## 2|3       6.9965       0.9325       7.5027
## 
## Residual Deviance: 1740.303 
## AIC: 1750.303

Interpretation of coefficient for FM: For 1 unit increase in FM, we expect about 7.77626 increase in expected value of NSP in the log odds scale, given that all other variables in the model are held constant.

#calculate pobabilities

p <- predict(train.m, ctg.train, type = "prob")
head(p)

##           1           2            3
## 1 0.9966799 0.002482844 0.0008373026
## 2 0.9590754 0.030304224 0.0106203572
## 3 0.9234234 0.056141683 0.0204349353
## 4 0.9970275 0.002223091 0.0007494434
## 5 0.7770082 0.155657845 0.0673339542
## 6 0.8232355 0.125517449 0.0512470313

We can also see here, standard error and t-values. The only thing to obtain here is p-values.

#calculating p-value
#we can obtain coefficients from this model using coef

(ctable <- coef(summary(train.m)))

##             Value  Std. Error       t value
## LB     0.03903007 0.006716933  5.810698e+00
## AC  -821.35553742 0.004881380 -1.682630e+05
## FM     7.03879328 1.242086196  5.666912e+00
## 1|2    5.61648231 0.921747160  6.093300e+00
## 2|3    6.99654897 0.932539248  7.502686e+00

#to obtain the p-values using the data from the model

p <- pnorm(abs(ctable[ , "t value"]), lower.tail = FALSE) * 2

#ctable with p-values

(ctable <- cbind(ctable, "p value"=p))

##             Value  Std. Error       t value      p value
## LB     0.03903007 0.006716933  5.810698e+00 6.221296e-09
## AC  -821.35553742 0.004881380 -1.682630e+05 0.000000e+00
## FM     7.03879328 1.242086196  5.666912e+00 1.453939e-08
## 1|2    5.61648231 0.921747160  6.093300e+00 1.106062e-09
## 2|3    6.99654897 0.932539248  7.502686e+00 6.252330e-14

We may notice that all the three variables - LB, AC, FM - are statistically significant because p-values are quite small.

For LB, p-value is 1.9 times 10 to the power of negative 9 which is stat. significant.

For AC, almost zero, which means 100% confident.

For FM, p-value is 5.8 times 10 to the power of negative 9 which is stat. significant.

Prediction

#make prediction using predict command

pred <- predict(train.m, ctg.test[1:5, ], type = 'probs')

print(pred, digits = 5)

##         1        2        3
## 1 0.71766 0.192285 0.090057
## 2 0.70158 0.201762 0.096662
## 3 0.80081 0.140301 0.058887
## 4 0.44255 0.316823 0.240625
## 5 0.88779 0.081396 0.030817

Predictions are made using model coefficients.

#Equations for calculating the probabilities

ctable

##             Value  Std. Error       t value      p value
## LB     0.03903007 0.006716933  5.810698e+00 6.221296e-09
## AC  -821.35553742 0.004881380 -1.682630e+05 0.000000e+00
## FM     7.03879328 1.242086196  5.666912e+00 1.453939e-08
## 1|2    5.61648231 0.921747160  6.093300e+00 1.106062e-09
## 2|3    6.99654897 0.932539248  7.502686e+00 6.252330e-14

b1 = 0.04007166 alpha1 = 5.73160832 b2 = -783.32006176 alpha2 = 7.21749369 b3 = 7.77625809

y = (b1x1) + (-b2x2) + (b3*x3)

#building model with all variables

train.m.f <- polr(NSP ~ . -Max, ctg.train, Hess = TRUE)

summary(train.m.f)

## Call:
## polr(formula = NSP ~ . - Max, data = ctg.train, Hess = TRUE)
## 
## Coefficients:
##                Value Std. Error    t value
## LB        -6.122e-03  2.201e-02 -2.782e-01
## AC        -7.993e+02  6.986e-03 -1.144e+05
## FM         3.321e+00  2.194e+00  1.514e+00
## UC        -2.333e+02  1.733e-02 -1.347e+04
## DL         5.574e+01  5.580e-03  9.990e+03
## DS         8.043e+02  3.876e-05  2.075e+07
## DP         2.828e+03  3.322e-03  8.514e+05
## ASTV       9.479e-02  9.058e-03  1.047e+01
## MSTV       6.072e-02  2.056e-01  2.953e-01
## ALTV       3.509e-02  4.773e-03  7.352e+00
## MLTV       3.301e-02  2.470e-02  1.336e+00
## Width      3.974e-02  8.305e-03  4.785e+00
## Min        6.001e-02  1.142e-02  5.253e+00
## Nmax      -5.724e-02  4.817e-02 -1.188e+00
## Nzeros     2.552e-03  1.254e-01  2.035e-02
## Mode      -7.004e-02  1.845e-02 -3.795e+00
## Mean       7.095e-02  2.415e-02  2.938e+00
## Median    -1.604e-02  3.076e-02 -5.213e-01
## Variance   3.093e-02  5.967e-03  5.183e+00
## Tendency0  6.704e-01  3.767e-01  1.780e+00
## Tendency1  1.125e+00  4.733e-01  2.377e+00
## 
## Intercepts:
##     Value         Std. Error    t value      
## 1|2       12.5026        1.5401        8.1179
## 2|3       15.4923        1.5788        9.8129
## 
## Residual Deviance: 971.8528 
## AIC: 1017.853

#calculate p-value to check variables which are significantly contributing to the model

ctable <- coef(summary(train.m.f))

p <- pnorm(abs(ctable[,"t value"]), lower.tail = FALSE) * 2

(ctable <- cbind(ctable, "p value" = p))

##                   Value   Std. Error       t value      p value
## LB        -6.122484e-03 2.200661e-02 -2.782112e-01 7.808503e-01
## AC        -7.992540e+02 6.986496e-03 -1.143998e+05 0.000000e+00
## FM         3.321253e+00 2.194298e+00  1.513583e+00 1.301315e-01
## UC        -2.333494e+02 1.732816e-02 -1.346649e+04 0.000000e+00
## DL         5.573979e+01 5.579833e-03  9.989508e+03 0.000000e+00
## DS         8.042621e+02 3.876024e-05  2.074967e+07 0.000000e+00
## DP         2.828068e+03 3.321716e-03  8.513877e+05 0.000000e+00
## ASTV       9.479063e-02 9.057662e-03  1.046524e+01 1.247565e-25
## MSTV       6.072220e-02 2.056318e-01  2.952958e-01 7.677680e-01
## ALTV       3.508623e-02 4.772528e-03  7.351708e+00 1.956899e-13
## MLTV       3.300574e-02 2.470342e-02  1.336080e+00 1.815232e-01
## Width      3.973807e-02 8.304815e-03  4.784944e+00 1.710353e-06
## Min        6.000871e-02 1.142328e-02  5.253196e+00 1.494819e-07
## Nmax      -5.723696e-02 4.816594e-02 -1.188328e+00 2.347041e-01
## Nzeros     2.552190e-03 1.253975e-01  2.035280e-02 9.837619e-01
## Mode      -7.003754e-02 1.845473e-02 -3.795099e+00 1.475846e-04
## Mean       7.095310e-02 2.414832e-02  2.938221e+00 3.301013e-03
## Median    -1.603545e-02 3.076289e-02 -5.212595e-01 6.021860e-01
## Variance   3.092812e-02 5.966865e-03  5.183311e+00 2.179813e-07
## Tendency0  6.704187e-01 3.767179e-01  1.779631e+00 7.513640e-02
## Tendency1  1.124826e+00 4.733099e-01  2.376511e+00 1.747723e-02
## 1|2        1.250265e+01 1.540130e+00  8.117917e+00 4.742523e-16
## 2|3        1.549231e+01 1.578766e+00  9.812920e+00 9.905981e-23

Let’s say in this analysis, if we want to include all those variables in which we have at least 90% confidence that the p-value of 0.10 or below is acceptable to us.

If we look at the p-values above, the first value of LB, 4.832237e-01, 4.8 times 10 to the power of negative 1, 0.48 which is 1-0.48 = 0.52, which is only 52% so LB is not contributing. We can drop LB variable for further analysis.

Likewise, FM,MSTV, MLTV, NMax, NZeros, Median below 90% significance, but Tendency we still kept because of Tendency1.

#build model with significant variables

train.m.f1 <- polr(NSP ~ . - Max-LB-FM-MSTV-MLTV-Nmax-Nzeros-Median, ctg.train)

This will become our final model. Now all the variables in the model are statistically signiicant. These coefficients are used for further predictions.

#Prediction 

#create confusion matrix for train dataset

pred <- predict(train.m.f1, ctg.train)

(tab.tr <- table(pred,ctg.train$NSP))

##     
## pred    1    2    3
##    1 1246   90    7
##    2   58  111   41
##    3    1   27   92

#misclassification on train dataset

1-sum(diag(tab.tr))/sum(tab.tr)

## [1] 0.1338912

The misclassification error on the train dataset is about 14%.

#conusion matrix on test data

pred1 <- predict(train.m.f1, ctg.test)

(tab.ts <- table(pred1, ctg.test$NSP))

##      
## pred1   1   2   3
##     1 337  28   1
##     2  13  30  11
##     3   0   9  24

#misclassification error on test data

1-sum(diag(tab.ts)/sum(tab.ts))

## [1] 0.1368653

There is about 12% misclassification error in the test data.

Ordinal Logistic Regression

Anil

April 11, 2019

data partition

Ordinal logistic regression or Proportional Odds Logistic Regression model

Prediction