library(readxl)
## Warning: package 'readxl' was built under R version 3.5.3
ctg <- read_xlsx("F:/Machine Learning/Data Science/Machine Learning/Ordinal/CTG.xlsx")
head(ctg)
## # A tibble: 6 x 22
## LB AC FM UC DL DS DP ASTV MSTV ALTV MLTV
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 120 0 0 0 0 0 0 73 0.5 43 2.4
## 2 132 0.00638 0 0.00638 0.00319 0 0 17 2.1 0 10.4
## 3 133 0.00332 0 0.00831 0.00332 0 0 16 2.1 0 13.4
## 4 134 0.00256 0 0.00768 0.00256 0 0 16 2.4 0 23
## 5 132 0.00651 0 0.00814 0 0 0 16 2.4 0 19.9
## 6 134 0.00105 0 0.0105 0.00944 0 0.00210 26 5.9 0 0
## # ... with 11 more variables: Width <dbl>, Min <dbl>, Max <dbl>,
## # Nmax <dbl>, Nzeros <dbl>, Mode <dbl>, Mean <dbl>, Median <dbl>,
## # Variance <dbl>, Tendency <dbl>, NSP <dbl>
str(ctg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2126 obs. of 22 variables:
## $ LB : num 120 132 133 134 132 134 134 122 122 122 ...
## $ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
## $ FM : num 0 0 0 0 0 0 0 0 0 0 ...
## $ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
## $ DL : num 0 0.00319 0.00332 0.00256 0 ...
## $ DS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DP : num 0 0 0 0 0 ...
## $ ASTV : num 73 17 16 16 16 26 29 83 84 86 ...
## $ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
## $ ALTV : num 43 0 0 0 0 0 0 6 5 6 ...
## $ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
## $ Width : num 64 130 130 117 117 150 150 68 68 68 ...
## $ Min : num 62 68 68 53 53 50 50 62 62 62 ...
## $ Max : num 126 198 198 170 170 200 200 130 130 130 ...
## $ Nmax : num 2 6 5 11 9 5 6 0 0 1 ...
## $ Nzeros : num 0 1 1 0 0 3 3 0 0 0 ...
## $ Mode : num 120 141 141 137 137 76 71 122 122 122 ...
## $ Mean : num 137 136 135 134 136 107 107 122 122 122 ...
## $ Median : num 121 140 138 137 138 107 106 123 123 123 ...
## $ Variance: num 73 12 13 13 11 170 215 3 3 1 ...
## $ Tendency: num 1 0 0 1 1 0 0 1 1 1 ...
## $ NSP : num 2 1 1 1 1 3 3 3 3 3 ...
Target variable NSP has three classes:
1 means normal 2 means suspect 3 means pathologic
#convert NSP into an ordinal variable
ctg$NSP <- as.ordered(ctg$NSP)
str(ctg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2126 obs. of 22 variables:
## $ LB : num 120 132 133 134 132 134 134 122 122 122 ...
## $ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
## $ FM : num 0 0 0 0 0 0 0 0 0 0 ...
## $ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
## $ DL : num 0 0.00319 0.00332 0.00256 0 ...
## $ DS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DP : num 0 0 0 0 0 ...
## $ ASTV : num 73 17 16 16 16 26 29 83 84 86 ...
## $ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
## $ ALTV : num 43 0 0 0 0 0 0 6 5 6 ...
## $ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
## $ Width : num 64 130 130 117 117 150 150 68 68 68 ...
## $ Min : num 62 68 68 53 53 50 50 62 62 62 ...
## $ Max : num 126 198 198 170 170 200 200 130 130 130 ...
## $ Nmax : num 2 6 5 11 9 5 6 0 0 1 ...
## $ Nzeros : num 0 1 1 0 0 3 3 0 0 0 ...
## $ Mode : num 120 141 141 137 137 76 71 122 122 122 ...
## $ Mean : num 137 136 135 134 136 107 107 122 122 122 ...
## $ Median : num 121 140 138 137 138 107 106 123 123 123 ...
## $ Variance: num 73 12 13 13 11 170 215 3 3 1 ...
## $ Tendency: num 1 0 0 1 1 0 0 1 1 1 ...
## $ NSP : Ord.factor w/ 3 levels "1"<"2"<"3": 2 1 1 1 1 3 3 3 3 3 ...
Now NSP variable got converted into ordinal variable where 1 is less than 2, 2 is less than 3, and 3 being the worst condition.
summary(ctg)
## LB AC FM UC
## Min. :106.0 Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:126.0 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.001876
## Median :133.0 Median :0.001630 Median :0.000000 Median :0.004482
## Mean :133.3 Mean :0.003170 Mean :0.009474 Mean :0.004357
## 3rd Qu.:140.0 3rd Qu.:0.005631 3rd Qu.:0.002512 3rd Qu.:0.006525
## Max. :160.0 Max. :0.019284 Max. :0.480634 Max. :0.014925
## DL DS DP
## Min. :0.000000 Min. :0.000e+00 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.000e+00 1st Qu.:0.0000000
## Median :0.000000 Median :0.000e+00 Median :0.0000000
## Mean :0.001885 Mean :3.585e-06 Mean :0.0001566
## 3rd Qu.:0.003264 3rd Qu.:0.000e+00 3rd Qu.:0.0000000
## Max. :0.015385 Max. :1.353e-03 Max. :0.0053476
## ASTV MSTV ALTV MLTV
## Min. :12.00 Min. :0.200 Min. : 0.000 Min. : 0.000
## 1st Qu.:32.00 1st Qu.:0.700 1st Qu.: 0.000 1st Qu.: 4.600
## Median :49.00 Median :1.200 Median : 0.000 Median : 7.400
## Mean :46.99 Mean :1.333 Mean : 9.847 Mean : 8.188
## 3rd Qu.:61.00 3rd Qu.:1.700 3rd Qu.:11.000 3rd Qu.:10.800
## Max. :87.00 Max. :7.000 Max. :91.000 Max. :50.700
## Width Min Max Nmax
## Min. : 3.00 Min. : 50.00 Min. :122 Min. : 0.000
## 1st Qu.: 37.00 1st Qu.: 67.00 1st Qu.:152 1st Qu.: 2.000
## Median : 67.50 Median : 93.00 Median :162 Median : 3.000
## Mean : 70.45 Mean : 93.58 Mean :164 Mean : 4.068
## 3rd Qu.:100.00 3rd Qu.:120.00 3rd Qu.:174 3rd Qu.: 6.000
## Max. :180.00 Max. :159.00 Max. :238 Max. :18.000
## Nzeros Mode Mean Median
## Min. : 0.0000 Min. : 60.0 Min. : 73.0 Min. : 77.0
## 1st Qu.: 0.0000 1st Qu.:129.0 1st Qu.:125.0 1st Qu.:129.0
## Median : 0.0000 Median :139.0 Median :136.0 Median :139.0
## Mean : 0.3236 Mean :137.5 Mean :134.6 Mean :138.1
## 3rd Qu.: 0.0000 3rd Qu.:148.0 3rd Qu.:145.0 3rd Qu.:148.0
## Max. :10.0000 Max. :187.0 Max. :182.0 Max. :186.0
## Variance Tendency NSP
## Min. : 0.00 Min. :-1.0000 1:1655
## 1st Qu.: 2.00 1st Qu.: 0.0000 2: 295
## Median : 7.00 Median : 0.0000 3: 176
## Mean : 18.81 Mean : 0.3203
## 3rd Qu.: 24.00 3rd Qu.: 1.0000
## Max. :269.00 Max. : 1.0000
Out of 21 variables, variable Tendency takes only three values, -1, 0, 1.
Also, convert Tendency variable into a categorical variable.
#Convert tendency into categorical variable
ctg$Tendency <- as.factor(ctg$Tendency)
Now to cross-tabulate NSP with Tendency.
#cross-tabulation
xtabs(~NSP + Tendency, ctg)
## Tendency
## NSP -1 0 1
## 1 101 887 667
## 2 15 137 143
## 3 49 91 36
Cross tabulation shows that we have sufficient size of frequency in each cell
We are going to partition the data into two parts, 80% for training, 20% for testing.
#Data partition
ind <- sample(2, nrow(ctg), replace = TRUE, prob = c(.8,.2))
ctg.train <- ctg[ind==1, ]
ctg.test <- ctg[ind==2, ]
dim(ctg.train)
## [1] 1673 22
dim(ctg.test)
## [1] 453 22
library(MASS)
#polr stands for Proportional Odds Logistic Regression model
#building with only three variables just to illustrate
#Hess = T, is to make sure to get standard errors
train.m <- polr(NSP ~ LB+AC+FM, ctg.train, Hess = TRUE)
summary(train.m)
## Call:
## polr(formula = NSP ~ LB + AC + FM, data = ctg.train, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## LB 0.03903 0.006717 5.811e+00
## AC -821.35554 0.004881 -1.683e+05
## FM 7.03879 1.242086 5.667e+00
##
## Intercepts:
## Value Std. Error t value
## 1|2 5.6165 0.9217 6.0933
## 2|3 6.9965 0.9325 7.5027
##
## Residual Deviance: 1740.303
## AIC: 1750.303
Interpretation of coefficient for FM: For 1 unit increase in FM, we expect about 7.77626 increase in expected value of NSP in the log odds scale, given that all other variables in the model are held constant.
#calculate pobabilities
p <- predict(train.m, ctg.train, type = "prob")
head(p)
## 1 2 3
## 1 0.9966799 0.002482844 0.0008373026
## 2 0.9590754 0.030304224 0.0106203572
## 3 0.9234234 0.056141683 0.0204349353
## 4 0.9970275 0.002223091 0.0007494434
## 5 0.7770082 0.155657845 0.0673339542
## 6 0.8232355 0.125517449 0.0512470313
We can also see here, standard error and t-values. The only thing to obtain here is p-values.
#calculating p-value
#we can obtain coefficients from this model using coef
(ctable <- coef(summary(train.m)))
## Value Std. Error t value
## LB 0.03903007 0.006716933 5.810698e+00
## AC -821.35553742 0.004881380 -1.682630e+05
## FM 7.03879328 1.242086196 5.666912e+00
## 1|2 5.61648231 0.921747160 6.093300e+00
## 2|3 6.99654897 0.932539248 7.502686e+00
#to obtain the p-values using the data from the model
p <- pnorm(abs(ctable[ , "t value"]), lower.tail = FALSE) * 2
#ctable with p-values
(ctable <- cbind(ctable, "p value"=p))
## Value Std. Error t value p value
## LB 0.03903007 0.006716933 5.810698e+00 6.221296e-09
## AC -821.35553742 0.004881380 -1.682630e+05 0.000000e+00
## FM 7.03879328 1.242086196 5.666912e+00 1.453939e-08
## 1|2 5.61648231 0.921747160 6.093300e+00 1.106062e-09
## 2|3 6.99654897 0.932539248 7.502686e+00 6.252330e-14
We may notice that all the three variables - LB, AC, FM - are statistically significant because p-values are quite small.
For LB, p-value is 1.9 times 10 to the power of negative 9 which is stat. significant.
For AC, almost zero, which means 100% confident.
For FM, p-value is 5.8 times 10 to the power of negative 9 which is stat. significant.
#make prediction using predict command
pred <- predict(train.m, ctg.test[1:5, ], type = 'probs')
print(pred, digits = 5)
## 1 2 3
## 1 0.71766 0.192285 0.090057
## 2 0.70158 0.201762 0.096662
## 3 0.80081 0.140301 0.058887
## 4 0.44255 0.316823 0.240625
## 5 0.88779 0.081396 0.030817
Predictions are made using model coefficients.
#Equations for calculating the probabilities
ctable
## Value Std. Error t value p value
## LB 0.03903007 0.006716933 5.810698e+00 6.221296e-09
## AC -821.35553742 0.004881380 -1.682630e+05 0.000000e+00
## FM 7.03879328 1.242086196 5.666912e+00 1.453939e-08
## 1|2 5.61648231 0.921747160 6.093300e+00 1.106062e-09
## 2|3 6.99654897 0.932539248 7.502686e+00 6.252330e-14
b1 = 0.04007166 alpha1 = 5.73160832 b2 = -783.32006176 alpha2 = 7.21749369 b3 = 7.77625809
y = (b1x1) + (-b2x2) + (b3*x3)
#building model with all variables
train.m.f <- polr(NSP ~ . -Max, ctg.train, Hess = TRUE)
summary(train.m.f)
## Call:
## polr(formula = NSP ~ . - Max, data = ctg.train, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## LB -6.122e-03 2.201e-02 -2.782e-01
## AC -7.993e+02 6.986e-03 -1.144e+05
## FM 3.321e+00 2.194e+00 1.514e+00
## UC -2.333e+02 1.733e-02 -1.347e+04
## DL 5.574e+01 5.580e-03 9.990e+03
## DS 8.043e+02 3.876e-05 2.075e+07
## DP 2.828e+03 3.322e-03 8.514e+05
## ASTV 9.479e-02 9.058e-03 1.047e+01
## MSTV 6.072e-02 2.056e-01 2.953e-01
## ALTV 3.509e-02 4.773e-03 7.352e+00
## MLTV 3.301e-02 2.470e-02 1.336e+00
## Width 3.974e-02 8.305e-03 4.785e+00
## Min 6.001e-02 1.142e-02 5.253e+00
## Nmax -5.724e-02 4.817e-02 -1.188e+00
## Nzeros 2.552e-03 1.254e-01 2.035e-02
## Mode -7.004e-02 1.845e-02 -3.795e+00
## Mean 7.095e-02 2.415e-02 2.938e+00
## Median -1.604e-02 3.076e-02 -5.213e-01
## Variance 3.093e-02 5.967e-03 5.183e+00
## Tendency0 6.704e-01 3.767e-01 1.780e+00
## Tendency1 1.125e+00 4.733e-01 2.377e+00
##
## Intercepts:
## Value Std. Error t value
## 1|2 12.5026 1.5401 8.1179
## 2|3 15.4923 1.5788 9.8129
##
## Residual Deviance: 971.8528
## AIC: 1017.853
#calculate p-value to check variables which are significantly contributing to the model
ctable <- coef(summary(train.m.f))
p <- pnorm(abs(ctable[,"t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
## Value Std. Error t value p value
## LB -6.122484e-03 2.200661e-02 -2.782112e-01 7.808503e-01
## AC -7.992540e+02 6.986496e-03 -1.143998e+05 0.000000e+00
## FM 3.321253e+00 2.194298e+00 1.513583e+00 1.301315e-01
## UC -2.333494e+02 1.732816e-02 -1.346649e+04 0.000000e+00
## DL 5.573979e+01 5.579833e-03 9.989508e+03 0.000000e+00
## DS 8.042621e+02 3.876024e-05 2.074967e+07 0.000000e+00
## DP 2.828068e+03 3.321716e-03 8.513877e+05 0.000000e+00
## ASTV 9.479063e-02 9.057662e-03 1.046524e+01 1.247565e-25
## MSTV 6.072220e-02 2.056318e-01 2.952958e-01 7.677680e-01
## ALTV 3.508623e-02 4.772528e-03 7.351708e+00 1.956899e-13
## MLTV 3.300574e-02 2.470342e-02 1.336080e+00 1.815232e-01
## Width 3.973807e-02 8.304815e-03 4.784944e+00 1.710353e-06
## Min 6.000871e-02 1.142328e-02 5.253196e+00 1.494819e-07
## Nmax -5.723696e-02 4.816594e-02 -1.188328e+00 2.347041e-01
## Nzeros 2.552190e-03 1.253975e-01 2.035280e-02 9.837619e-01
## Mode -7.003754e-02 1.845473e-02 -3.795099e+00 1.475846e-04
## Mean 7.095310e-02 2.414832e-02 2.938221e+00 3.301013e-03
## Median -1.603545e-02 3.076289e-02 -5.212595e-01 6.021860e-01
## Variance 3.092812e-02 5.966865e-03 5.183311e+00 2.179813e-07
## Tendency0 6.704187e-01 3.767179e-01 1.779631e+00 7.513640e-02
## Tendency1 1.124826e+00 4.733099e-01 2.376511e+00 1.747723e-02
## 1|2 1.250265e+01 1.540130e+00 8.117917e+00 4.742523e-16
## 2|3 1.549231e+01 1.578766e+00 9.812920e+00 9.905981e-23
Let’s say in this analysis, if we want to include all those variables in which we have at least 90% confidence that the p-value of 0.10 or below is acceptable to us.
If we look at the p-values above, the first value of LB, 4.832237e-01, 4.8 times 10 to the power of negative 1, 0.48 which is 1-0.48 = 0.52, which is only 52% so LB is not contributing. We can drop LB variable for further analysis.
Likewise, FM,MSTV, MLTV, NMax, NZeros, Median below 90% significance, but Tendency we still kept because of Tendency1.
#build model with significant variables
train.m.f1 <- polr(NSP ~ . - Max-LB-FM-MSTV-MLTV-Nmax-Nzeros-Median, ctg.train)
This will become our final model. Now all the variables in the model are statistically signiicant. These coefficients are used for further predictions.
#Prediction
#create confusion matrix for train dataset
pred <- predict(train.m.f1, ctg.train)
(tab.tr <- table(pred,ctg.train$NSP))
##
## pred 1 2 3
## 1 1246 90 7
## 2 58 111 41
## 3 1 27 92
#misclassification on train dataset
1-sum(diag(tab.tr))/sum(tab.tr)
## [1] 0.1338912
The misclassification error on the train dataset is about 14%.
#conusion matrix on test data
pred1 <- predict(train.m.f1, ctg.test)
(tab.ts <- table(pred1, ctg.test$NSP))
##
## pred1 1 2 3
## 1 337 28 1
## 2 13 30 11
## 3 0 9 24
#misclassification error on test data
1-sum(diag(tab.ts)/sum(tab.ts))
## [1] 0.1368653
There is about 12% misclassification error in the test data.