DIVACTORY 01 Report By Djoko Soehartono and M. Farhan Rashid
In this DIVACTORY 01 Warm Up case study, we are given Training_Data.csv
and Test_Data.csv
CSV data files for training and test data correspondingly. Objectives for this case study are: 1. Develop an exploratory data analysis based on given training and test datasets 2. Develop a model to minimize negative log-loss function 3. Document the mmodel results in a documentation and slides
There was no further information shared by the competition host regarding the training and test datasets. We will try to explore and get more understanding about this in our exploratory data analysis.
The first step is to extract the training and test datasets into Training_Data
and Test_Data
data frames as below.
Training_Data <- read.csv("Training_Data.csv", header = TRUE, sep = ",")
Test_Data <- read.csv("Test_Data.csv", header = TRUE, sep = ",")
Let’s see the datasets on how they look like with R str()
function. We see that Training_Data
consists of 22 columns with 21 columns as input variables and one output variable target
. Test_Data
datasets consists only 21 input columns without target
output column. Training_Data
has 21681 rows and Test_Data
has 86724 rows respectively.
Since target
column has only 2 output classes, i.e. 0
and 1
, we will define this as a factor with as.factor()
function.
str(Training_Data)
## 'data.frame': 21681 obs. of 22 variables:
## $ peubah1 : num 0.416 0.393 0.412 0.416 0.431 ...
## $ peubah2 : num 0.434 0.461 0.464 0.434 0.461 ...
## $ peubah3 : num 0.48 0.453 0.418 0.483 0.419 ...
## $ peubah4 : num 0.391 0.405 0.455 0.358 0.45 ...
## $ peubah5 : num 0.312 0.377 0.415 0.308 0.409 ...
## $ peubah6 : num 0.35 0.393 0.418 0.342 0.45 ...
## $ peubah7 : num 0.48 0.407 0.336 0.501 0.371 ...
## $ peubah8 : num 0.367 0.409 0.397 0.376 0.372 ...
## $ peubah9 : num 0.297 0.256 0.359 0.271 0.324 ...
## $ peubah10: num 0.408 0.491 0.497 0.422 0.458 ...
## $ peubah11: num 0.328 0.368 0.404 0.319 0.404 ...
## $ peubah12: num 0.377 0.429 0.452 0.367 0.438 ...
## $ peubah13: num 0.346 0.358 0.396 0.339 0.404 ...
## $ peubah14: num 0.519 0.479 0.413 0.538 0.423 ...
## $ peubah15: num 0.462 0.447 0.368 0.476 0.394 ...
## $ peubah16: num 0.352 0.335 0.389 0.342 0.388 ...
## $ peubah17: num 0.455 0.446 0.415 0.458 0.408 ...
## $ peubah18: num 0.481 0.452 0.413 0.491 0.412 ...
## $ peubah19: num 0.442 0.422 0.359 0.455 0.391 ...
## $ peubah20: num 0.382 0.376 0.373 0.392 0.367 ...
## $ peubah21: num 0.361 0.305 0.299 0.37 0.322 ...
## $ target : int 1 0 0 1 1 1 0 1 1 0 ...
str(Test_Data)
## 'data.frame': 86724 obs. of 21 variables:
## $ peubah1 : num 0.401 0.471 0.441 0.426 0.45 ...
## $ peubah2 : num 0.444 0.442 0.324 0.551 0.343 ...
## $ peubah3 : num 0.465 0.405 0.484 0.388 0.416 ...
## $ peubah4 : num 0.376 0.428 0.309 0.49 0.387 ...
## $ peubah5 : num 0.348 0.408 0.315 0.428 0.421 ...
## $ peubah6 : num 0.352 0.422 0.208 0.439 0.381 ...
## $ peubah7 : num 0.439 0.408 0.468 0.393 0.298 ...
## $ peubah8 : num 0.408 0.337 0.373 0.373 0.399 ...
## $ peubah9 : num 0.285 0.355 0.422 0.218 0.493 ...
## $ peubah10: num 0.468 0.422 0.372 0.538 0.409 ...
## $ peubah11: num 0.342 0.408 0.309 0.438 0.383 ...
## $ peubah12: num 0.396 0.425 0.333 0.508 0.373 ...
## $ peubah13: num 0.348 0.425 0.344 0.418 0.409 ...
## $ peubah14: num 0.497 0.427 0.491 0.457 0.336 ...
## $ peubah15: num 0.432 0.391 0.383 0.471 0.2 ...
## $ peubah16: num 0.349 0.41 0.352 0.393 0.423 ...
## $ peubah17: num 0.451 0.387 0.46 0.385 0.409 ...
## $ peubah18: num 0.472 0.405 0.496 0.383 0.423 ...
## $ peubah19: num 0.406 0.413 0.43 0.418 0.274 ...
## $ peubah20: num 0.398 0.371 0.452 0.289 0.471 ...
## $ peubah21: num 0.343 0.363 0.46 0.193 0.453 ...
Training_Data$target <- as.factor(Training_Data$target)
Let’s see the range of values for all input and output variables of Training_Data
as below:
summary(Training_Data)
## peubah1 peubah2 peubah3 peubah4
## Min. :0.0000 Min. :0.0265 Min. :0.08678 Min. :0.0000
## 1st Qu.:0.3758 1st Qu.:0.3918 1st Qu.:0.43228 1st Qu.:0.3426
## Median :0.4052 Median :0.4346 Median :0.46691 Median :0.3827
## Mean :0.4041 Mean :0.4285 Mean :0.46388 Mean :0.3785
## 3rd Qu.:0.4363 3rd Qu.:0.4720 3rd Qu.:0.49945 3rd Qu.:0.4191
## Max. :0.6808 Max. :0.6661 Max. :0.68957 Max. :0.6288
## peubah5 peubah6 peubah7 peubah8
## Min. :0.01573 Min. :0.06821 Min. :0.04294 Min. :0.06272
## 1st Qu.:0.31949 1st Qu.:0.36199 1st Qu.:0.39151 1st Qu.:0.36220
## Median :0.35690 Median :0.40701 Median :0.43157 Median :0.39411
## Mean :0.35554 Mean :0.40325 Mean :0.42613 Mean :0.39331
## 3rd Qu.:0.39357 3rd Qu.:0.44870 3rd Qu.:0.46296 3rd Qu.:0.42481
## Max. :0.59906 Max. :0.69315 Max. :0.66133 Max. :0.69315
## peubah9 peubah10 peubah11
## Min. :0.005346 Min. :0.0000 Min. :0.001948
## 1st Qu.:0.242413 1st Qu.:0.4130 1st Qu.:0.308727
## Median :0.283388 Median :0.4519 Median :0.347334
## Mean :0.288719 Mean :0.4461 Mean :0.345317
## 3rd Qu.:0.330045 3rd Qu.:0.4859 3rd Qu.:0.384664
## Max. :0.679363 Max. :0.6655 Max. :0.630127
## peubah12 peubah13 peubah14 peubah15
## Min. :0.0007097 Min. :0.007948 Min. :0.1275 Min. :0.0646
## 1st Qu.:0.3475252 1st Qu.:0.321467 1st Qu.:0.4516 1st Qu.:0.3957
## Median :0.3905207 Median :0.354438 Median :0.4861 Median :0.4352
## Mean :0.3862154 Mean :0.353111 Mean :0.4830 Mean :0.4297
## 3rd Qu.:0.4291426 3rd Qu.:0.387220 3rd Qu.:0.5171 3rd Qu.:0.4688
## Max. :0.6603450 Max. :0.651168 Max. :0.6929 Max. :0.6636
## peubah16 peubah17 peubah18 peubah19
## Min. :0.01702 Min. :0.06889 Min. :0.1159 Min. :0.001149
## 1st Qu.:0.30723 1st Qu.:0.42082 1st Qu.:0.4327 1st Qu.:0.387125
## Median :0.34110 Median :0.45101 Median :0.4671 Median :0.425967
## Mean :0.34021 Mean :0.44853 Mean :0.4646 Mean :0.420817
## 3rd Qu.:0.37417 3rd Qu.:0.47993 3rd Qu.:0.4996 3rd Qu.:0.456107
## Max. :0.62958 Max. :0.69082 Max. :0.6912 Max. :0.659270
## peubah20 peubah21 target
## Min. :0.06297 Min. :0.03861 0:10756
## 1st Qu.:0.36750 1st Qu.:0.31216 1:10925
## Median :0.39835 Median :0.35353
## Mean :0.39924 Mean :0.35739
## 3rd Qu.:0.43211 3rd Qu.:0.40094
## Max. :0.65263 Max. :0.69315
At glance, it seems all input variables in Training_Data
have similar ranges for minimum value 0 to maximum value between 0.6 to 0.7. Almost all the medians are in range between 0.3 to 0.5.
Now let’s see the scatterplot matrix between all variables in Training_Data
. To be complete, it should have shown 22 by 22 scatterplot panels but it will be difficult to see in this report. So we tried to divide the scattterplot matrix into three plots as below. Although they are not exhaustive plots, we can still get a good sense of how the Training_Data
is distributed across all of its variable pairs.
We see that the data is scattered differently between the variable pairs. The blue and green colored dots in each panel represent data points for output classess 0
and 1
. If we see in each panel, there is no variable that can significantly “separate” those blue and green dots.
library(graphics)
pairs(~ peubah1+peubah2+peubah3+peubah4+peubah5+peubah6+peubah7+target,
data = Training_Data, main = "peubah1 to peubah7 + target",
col = 3 + (Training_Data$target == 1), upper.panel = NULL)
pairs(~ peubah8+peubah9+peubah10+peubah11+peubah12+peubah13+peubah14+target,
data = Training_Data, main = "peubah8 to peubah14 + target",
col = 3 + (Training_Data$target == 1), upper.panel = NULL)
pairs(~ peubah15+peubah16+peubah17+peubah18+peubah19+peubah20+peubah21+target,
data = Training_Data, main = "peubah15 to peubah21 + target",
col = 3 + (Training_Data$target == 1), upper.panel = NULL)
Furthermore let’s see the Training_Data
from another angle, from its input variables’ cross correlations. The correlations plot is shown below.
Apart from the diagonal, which is self-correlation for the variables (perfect 1.0 correlation), we also see high value cross-correlations with dark blue and dark red colored squares, i.e. values near to +1.0 or -1.0 respectively. This may imply a good number of co-linearities between different input variables which may suggest a dimension reduction can be performed to the data, for example with principal component analysis (PCA). We will discuss this later during our prediction modelling.
library(corrplot)
corrplot(cor(Training_Data[,-22]), type = "lower", method = "square")
Given our exploratory data analysis above, it is also interesting to see how the data is distributed for each variable. Those are shown in two plots below.
We see that distribution shape for input variables are “almost identical” by looking visually to the plot.
library(purrr)
library(tidyr)
library(ggplot2)
Training_Data[ ,-22] %>%
keep(is.numeric) %>%
gather %>%
ggplot(aes(value)) +
facet_wrap(~key, scales = "free", nrow = 3, ncol = 7) +
geom_histogram(alpha = .6, fill = "navy", binwidth = 0.01, aes(y = ..density..))
g <- ggplot(data = Training_Data, aes(x = target))
g + geom_bar(fill = "navy", alpha = .6) + ggtitle("Count of '0' and '1' in target")
Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric for classifiers. Before we start building a model, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted.
Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is basically equivalent to maximising the accuracy of the classifier.
In order to calculate Log Loss the classifier must assign a probability to each class rather than simply yielding the most likely class. The multi-class version of Logarithmic Loss metric is shown below. Each observation is in one class and for each observation, we submit a predicted probability for each class. The metric is negative the log likelihood of the model that says each test observation is chosen independently from a distribution that places the submitted probability mass on the corresponding class, for each observation.
\[logloss = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^M y_{i,j} \log (p_{i,j})\] where \(N\) is the number of observations, \(M\) is the number of class labels, \(log\) is the natural logarithm, \(y_{i,j}\) is 1 if observation \(i\) is in class \(j\) and 0 otherwise, and \(p_{i,j}\) is the predicted probability that observation \(i\) is in class \(j\).
If there are only two classes then the expression above simplifies to
\[logloss = - \frac{1}{N} \sum_{i=1}^{N} [y_i \log p_i + (1-y_i) \log (1 - p_i)]\] Note that for each instance only the term for the correct class actually contributes to the sum.
The first model we would like to assess is the logistic regression using log linear model, since it is closely related to the log-loss function.
Log linear model assumes that the log-odds of the conditional probability of the target given the features is a weighted linear combination of features. These weights are the parameters of the model which we’d like to estimate.
If we assume independence of samples, our objective is to maximize the product of prediction probability of the data or equivalently minimize the sum of negative logarithm of prediction probabilities. The mathematical expression for the negative log prediction probability of the log-linear model for a given samples is called the log-loss.
You may wonder why log-linear models are used in the first place. It turns out that log-linear models are the least biased models which preserve the empirical feature probabilities. Intuitively, a reasonable model for an unknown binary probability distribution given some data is a maximum entropy distribution whose feature expectations agree with the empirical feature expectations in the data. We can prove this leads us to exponential family of distributions and log-linear models are the simplest instance of this.
We are going to implement our first model using train
function from caret
in R. Logistic regression is done through generalized linear model method glm
. Please note that we need to set paramater family = "binomial"
since our target
variable has two classes output, which are '0'
and '1'
.
##logistic regression
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:caret':
##
## RMSE
## The following object is masked from 'package:base':
##
## Recall
library(stats)
glm.model <- train(target ~ ., data = Training_Data, method = "glm", family = "binomial")
summary(glm.model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.444 -1.183 1.033 1.167 1.727
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.1965 9.2464 -0.670 0.5028
## peubah1 26.5443 40.2072 0.660 0.5091
## peubah2 13.3293 12.6343 1.055 0.2914
## peubah3 117.8905 66.3793 1.776 0.0757 .
## peubah4 -11.1402 12.8283 -0.868 0.3852
## peubah5 19.1890 37.6330 0.510 0.6101
## peubah6 0.5534 3.6806 0.150 0.8805
## peubah7 -39.0299 15.9341 -2.449 0.0143 *
## peubah8 49.9100 21.5697 2.314 0.0207 *
## peubah9 -2.1985 4.5045 -0.488 0.6255
## peubah10 -29.5315 18.6817 -1.581 0.1139
## peubah11 -28.9118 128.8852 -0.224 0.8225
## peubah12 35.9269 49.1800 0.731 0.4651
## peubah13 -52.4888 104.9674 -0.500 0.6170
## peubah14 55.7961 38.5336 1.448 0.1476
## peubah15 -19.8469 16.7281 -1.186 0.2354
## peubah16 26.6816 19.8542 1.344 0.1790
## peubah17 -112.3319 61.1337 -1.837 0.0661 .
## peubah18 -66.3396 64.8757 -1.023 0.3065
## peubah19 22.6035 11.5529 1.957 0.0504 .
## peubah20 -5.1605 24.6152 -0.210 0.8339
## peubah21 11.1129 11.5298 0.964 0.3351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30055 on 21680 degrees of freedom
## Residual deviance: 29972 on 21659 degrees of freedom
## AIC: 30016
##
## Number of Fisher Scoring iterations: 3
If we see the result, we can see that no particular input variables is statistically significant as the predictor. All p-values for the variables are larger than 0.01. In other words, there is no strong assosication between any particular input variable with the probability of output target
being in class 1
.
Now we can run the anova()
function on the model to analyze the table of deviance:
anova(glm.model$finalModel, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: .outcome
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 21680 30055
## peubah1 1 2.1402 21679 30053 0.1434813
## peubah2 1 13.4638 21678 30039 0.0002432 ***
## peubah3 1 0.2570 21677 30039 0.6121832
## peubah4 1 4.4466 21676 30035 0.0349709 *
## peubah5 1 5.4259 21675 30029 0.0198404 *
## peubah6 1 6.1914 21674 30023 0.0128372 *
## peubah7 1 8.9440 21673 30014 0.0027838 **
## peubah8 1 10.5703 21672 30004 0.0011492 **
## peubah9 1 0.0823 21671 30003 0.7741746
## peubah10 1 5.9096 21670 29998 0.0150583 *
## peubah11 1 4.8102 21669 29993 0.0282919 *
## peubah12 1 4.1883 21668 29989 0.0407043 *
## peubah13 1 0.0093 21667 29989 0.9232974
## peubah14 1 0.3093 21666 29988 0.5781296
## peubah15 1 0.0974 21665 29988 0.7549172
## peubah16 1 4.5919 21664 29984 0.0321225 *
## peubah17 1 6.6682 21663 29977 0.0098146 **
## peubah18 1 0.5798 21662 29976 0.4463804
## peubah19 1 3.2008 21661 29973 0.0736029 .
## peubah20 1 0.1733 21660 29973 0.6771516
## peubah21 1 0.9287 21659 29972 0.3351983
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. We can see that peubah2
is relatively more significant to reduce the residual deviance.
Let’s calculate the negative log-loss for our first model as below. Here we evaluate the Logloss
function from MLmetrics
package and passing target
as “true values” and model output’s fitted values as “prediction values” as y_true
and y_pred
respectively. This will yield negative log-loss value of 0.69, which is interesting and we will discuss this further at the end section of this report.
(glm.ll_train <- LogLoss(y_true = as.numeric(Training_Data$target)-1,
y_pred = glm.model$finalModel$fitted.values))
## [1] 0.6912029
In the steps above, we briefly evaluated the fitting of the model and the accuracy using negative log-loss function, now we would like to see how the model is doing when predicting the output target
variable. Here we can only evaluate miss-classification rate from Training_Data
since our Test_Data
does not have target
variable column. By setting the parameter type = "response"
, R will output probabilities in the form of \(P(y=1|X)\). Our decision boundary will be 0.5. If \(P(y=1|X)\) > 0.5 then \(y\) = 1 otherwise \(y\) = 0. If we see from our exploratory data analysis above that number of output target
= 1
is almost the same with the number of target
= 0
, then decision boundary of 0.5 seems a good fit.
glm.pred_train <- predict(glm.model$finalModel, Training_Data, type = "response")
glm.pred_train <- ifelse(glm.pred_train > 0.5, 1, 0)
(MissClassifError <- mean(glm.pred_train != Training_Data$target))
## [1] 0.4742862
The above result shows miss-classification error of about 47%, or output prediction accuracy of about 53%. This level of accuracy is expected given our exploratory data analysis above that has shown that similar number of 0
and 1
output classes and i.i.d property of all the input variables.
Then we can apply this model to predict probability of output target
= 1
and its distribution for Test_Data
. Result is shown below:
glm.pred_test <- predict(glm.model$finalModel, Test_Data, type = "response")
summary(glm.pred_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1594 0.4868 0.5045 0.5038 0.5218 0.7050
We are also curious to see if we can adjust our first model further. Since the data has so many predictors (21 input variables) and quite few are statistically significant, we would like also to see if we can “simplify” the model while not reducing much of its log-loss metrics, and hence its prediction accuracy.
Here we introduce pre-processing of the data with PCA (principal component analysis) by passing through preProcess = "pca"
parameter to our logistic regression model as below:
glm2.model <- train(target ~ ., data = Training_Data, method = "glm",
family = "binomial", preProcess = "pca")
summary(glm2.model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.343 -1.179 1.083 1.174 1.264
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.015624 0.013591 1.150 0.250
## PC1 0.005344 0.003909 1.367 0.172
## PC2 -0.009849 0.006402 -1.538 0.124
## PC3 0.031700 0.007287 4.350 1.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30055 on 21680 degrees of freedom
## Residual deviance: 30032 on 21677 degrees of freedom
## AIC: 30040
##
## Number of Fisher Scoring iterations: 3
As shown below, we see there are 3 PCA components are used to capture 95% of the variance of all input variables. Below is also shown the adjusted logistic regression model performance using anova()
function over the PCA components.
glm2.model$preProcess
## Created from 21681 samples and 21 variables
##
## Pre-processing:
## - centered (21)
## - ignored (0)
## - principal component signal extraction (21)
## - scaled (21)
##
## PCA needed 3 components to capture 95 percent of the variance
anova(glm2.model$finalModel, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: .outcome
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 21680 30055
## PC1 1 1.8661 21679 30053 0.1719
## PC2 1 2.3622 21678 30051 0.1243
## PC3 1 18.9621 21677 30032 1.333e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We evaluate the negative log-loss for our adjusted model and we obtain 0.69 which is very similar with our first model. This suggests that our adjusted model’s classifier performance is pretty similar with our first model.
(glm2.ll_train <- LogLoss(y_true = as.numeric(Training_Data$target)-1,
y_pred = glm2.model$finalModel$fitted.values))
## [1] 0.692582
glm2.pred_train <- predict(glm2.model, Training_Data, type = "prob")[,2]
Let’s see our adjusted model’s accuracy by setting the same decision boundary of 0.5. Here we see a slight reduction accuracy from 53% to about 51% in our adjusted model.
glm2.pred_train <- ifelse(glm2.pred_train > 0.5, 1, 0)
(MissClassifError <- mean(glm2.pred_train != Training_Data$target))
## [1] 0.4862783
Again, if we apply the adjusted model to predict output from Test_Data
dataset, we obtain a similar probabilities distribution with our initial model.
glm2.pred_test <- predict(glm2.model, Test_Data, type = "prob")[,2]
summary(glm.pred_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1594 0.4868 0.5045 0.5038 0.5218 0.7050
LDA model can be used for supervised training data, which is the case in our Training_Data
dataset. Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques that are commonly used for dimensionality reduction. PCA’s goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA computes the directions (“linear discriminants”) that will represent the axes that that maximize the separation between multiple classes in our output variable.
We implement LDA model as below:
lda.model <- train(target ~ ., data = Training_Data, method = "lda")
## Loading required package: MASS
The fitted values (predictions) of Training_Data
output target
and the log-loss are computed as below, which again yiels negative log-loss value of 0.69. This is again similar with our previous two models above.
lda.pred_train <- predict(lda.model, Training_Data, type = "prob")[,2]
(lda.ll_train <- LogLoss(y_true = as.numeric(Training_Data$target)-1,
y_pred = lda.pred_train))
## [1] 0.6912032
This LDA model accuracy is also again at about 53%, suggesting similar prediction performance with the previous two models.
lda.pred_train <- ifelse(lda.pred_train > 0.5, 1, 0)
(MissClassifError <- mean(lda.pred_train != Training_Data$target))
## [1] 0.474194
Now let’s see prediction values over the Test_Data
. This also results in similar probability distributions with our previous models.
lda.pred_test <- predict(lda.model, Test_Data, type = "prob")[,2]
summary(glm.pred_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1594 0.4868 0.5045 0.5038 0.5218 0.7050
An important observation regarding our Training_Data
when we did exploratory data analysis was their i.i.d feature when we plotted input variable distributions. Also when we observed that probability of outcomes of 0
and 1
are almost equal at about 0.5. This suggests naive bayes property for the input variables so we can just multiply the outputs probabilities to obtain our maximum likelihood function of our data. Taking the negative logarithmic of this maximum likelihood function and averaging across all \(N\) data samples then we get our negative log-loss function.
Also we see that probability of output for 1
or 0
at about 0.5 suggests maximum entropy of the data. Due to this condition, then we can only get minimum negative log-loss value at about 0.69. This is simply obtained by taking \(p_i\) = 0.5 and \(y_i\) = 1 to our binary log-loss function above.
So to conclude our analysis:
Input variables in Training_Data
are almost i.i.d
Probability of output 1
is about 0.5, hence probability of output 0
is also about 0.5
Negative log-loss function given this Training_Data
set for different models are almost similar, at about 0.69. We can approximate this value by taking \(p_i\) = 0.5 and \(y_i\) = 1 to our binary log-loss function