DIVACTORY 01 Report By Djoko Soehartono and M. Farhan Rashid

Executive Summary

In this DIVACTORY 01 Warm Up case study, we are given Training_Data.csv and Test_Data.csv CSV data files for training and test data correspondingly. Objectives for this case study are: 1. Develop an exploratory data analysis based on given training and test datasets 2. Develop a model to minimize negative log-loss function 3. Document the mmodel results in a documentation and slides

There was no further information shared by the competition host regarding the training and test datasets. We will try to explore and get more understanding about this in our exploratory data analysis.

Data Processing and Exploratory Data Analysis

The first step is to extract the training and test datasets into Training_Data and Test_Data data frames as below.

Training_Data <- read.csv("Training_Data.csv", header = TRUE, sep = ",")
Test_Data <- read.csv("Test_Data.csv", header = TRUE, sep = ",")

Let’s see the datasets on how they look like with R str() function. We see that Training_Data consists of 22 columns with 21 columns as input variables and one output variable target. Test_Data datasets consists only 21 input columns without target output column. Training_Data has 21681 rows and Test_Data has 86724 rows respectively.

Since target column has only 2 output classes, i.e. 0 and 1, we will define this as a factor with as.factor() function.

str(Training_Data)
## 'data.frame':    21681 obs. of  22 variables:
##  $ peubah1 : num  0.416 0.393 0.412 0.416 0.431 ...
##  $ peubah2 : num  0.434 0.461 0.464 0.434 0.461 ...
##  $ peubah3 : num  0.48 0.453 0.418 0.483 0.419 ...
##  $ peubah4 : num  0.391 0.405 0.455 0.358 0.45 ...
##  $ peubah5 : num  0.312 0.377 0.415 0.308 0.409 ...
##  $ peubah6 : num  0.35 0.393 0.418 0.342 0.45 ...
##  $ peubah7 : num  0.48 0.407 0.336 0.501 0.371 ...
##  $ peubah8 : num  0.367 0.409 0.397 0.376 0.372 ...
##  $ peubah9 : num  0.297 0.256 0.359 0.271 0.324 ...
##  $ peubah10: num  0.408 0.491 0.497 0.422 0.458 ...
##  $ peubah11: num  0.328 0.368 0.404 0.319 0.404 ...
##  $ peubah12: num  0.377 0.429 0.452 0.367 0.438 ...
##  $ peubah13: num  0.346 0.358 0.396 0.339 0.404 ...
##  $ peubah14: num  0.519 0.479 0.413 0.538 0.423 ...
##  $ peubah15: num  0.462 0.447 0.368 0.476 0.394 ...
##  $ peubah16: num  0.352 0.335 0.389 0.342 0.388 ...
##  $ peubah17: num  0.455 0.446 0.415 0.458 0.408 ...
##  $ peubah18: num  0.481 0.452 0.413 0.491 0.412 ...
##  $ peubah19: num  0.442 0.422 0.359 0.455 0.391 ...
##  $ peubah20: num  0.382 0.376 0.373 0.392 0.367 ...
##  $ peubah21: num  0.361 0.305 0.299 0.37 0.322 ...
##  $ target  : int  1 0 0 1 1 1 0 1 1 0 ...
str(Test_Data)
## 'data.frame':    86724 obs. of  21 variables:
##  $ peubah1 : num  0.401 0.471 0.441 0.426 0.45 ...
##  $ peubah2 : num  0.444 0.442 0.324 0.551 0.343 ...
##  $ peubah3 : num  0.465 0.405 0.484 0.388 0.416 ...
##  $ peubah4 : num  0.376 0.428 0.309 0.49 0.387 ...
##  $ peubah5 : num  0.348 0.408 0.315 0.428 0.421 ...
##  $ peubah6 : num  0.352 0.422 0.208 0.439 0.381 ...
##  $ peubah7 : num  0.439 0.408 0.468 0.393 0.298 ...
##  $ peubah8 : num  0.408 0.337 0.373 0.373 0.399 ...
##  $ peubah9 : num  0.285 0.355 0.422 0.218 0.493 ...
##  $ peubah10: num  0.468 0.422 0.372 0.538 0.409 ...
##  $ peubah11: num  0.342 0.408 0.309 0.438 0.383 ...
##  $ peubah12: num  0.396 0.425 0.333 0.508 0.373 ...
##  $ peubah13: num  0.348 0.425 0.344 0.418 0.409 ...
##  $ peubah14: num  0.497 0.427 0.491 0.457 0.336 ...
##  $ peubah15: num  0.432 0.391 0.383 0.471 0.2 ...
##  $ peubah16: num  0.349 0.41 0.352 0.393 0.423 ...
##  $ peubah17: num  0.451 0.387 0.46 0.385 0.409 ...
##  $ peubah18: num  0.472 0.405 0.496 0.383 0.423 ...
##  $ peubah19: num  0.406 0.413 0.43 0.418 0.274 ...
##  $ peubah20: num  0.398 0.371 0.452 0.289 0.471 ...
##  $ peubah21: num  0.343 0.363 0.46 0.193 0.453 ...
Training_Data$target <- as.factor(Training_Data$target)

Let’s see the range of values for all input and output variables of Training_Data as below:

summary(Training_Data)
##     peubah1          peubah2          peubah3           peubah4      
##  Min.   :0.0000   Min.   :0.0265   Min.   :0.08678   Min.   :0.0000  
##  1st Qu.:0.3758   1st Qu.:0.3918   1st Qu.:0.43228   1st Qu.:0.3426  
##  Median :0.4052   Median :0.4346   Median :0.46691   Median :0.3827  
##  Mean   :0.4041   Mean   :0.4285   Mean   :0.46388   Mean   :0.3785  
##  3rd Qu.:0.4363   3rd Qu.:0.4720   3rd Qu.:0.49945   3rd Qu.:0.4191  
##  Max.   :0.6808   Max.   :0.6661   Max.   :0.68957   Max.   :0.6288  
##     peubah5           peubah6           peubah7           peubah8       
##  Min.   :0.01573   Min.   :0.06821   Min.   :0.04294   Min.   :0.06272  
##  1st Qu.:0.31949   1st Qu.:0.36199   1st Qu.:0.39151   1st Qu.:0.36220  
##  Median :0.35690   Median :0.40701   Median :0.43157   Median :0.39411  
##  Mean   :0.35554   Mean   :0.40325   Mean   :0.42613   Mean   :0.39331  
##  3rd Qu.:0.39357   3rd Qu.:0.44870   3rd Qu.:0.46296   3rd Qu.:0.42481  
##  Max.   :0.59906   Max.   :0.69315   Max.   :0.66133   Max.   :0.69315  
##     peubah9            peubah10         peubah11       
##  Min.   :0.005346   Min.   :0.0000   Min.   :0.001948  
##  1st Qu.:0.242413   1st Qu.:0.4130   1st Qu.:0.308727  
##  Median :0.283388   Median :0.4519   Median :0.347334  
##  Mean   :0.288719   Mean   :0.4461   Mean   :0.345317  
##  3rd Qu.:0.330045   3rd Qu.:0.4859   3rd Qu.:0.384664  
##  Max.   :0.679363   Max.   :0.6655   Max.   :0.630127  
##     peubah12            peubah13           peubah14         peubah15     
##  Min.   :0.0007097   Min.   :0.007948   Min.   :0.1275   Min.   :0.0646  
##  1st Qu.:0.3475252   1st Qu.:0.321467   1st Qu.:0.4516   1st Qu.:0.3957  
##  Median :0.3905207   Median :0.354438   Median :0.4861   Median :0.4352  
##  Mean   :0.3862154   Mean   :0.353111   Mean   :0.4830   Mean   :0.4297  
##  3rd Qu.:0.4291426   3rd Qu.:0.387220   3rd Qu.:0.5171   3rd Qu.:0.4688  
##  Max.   :0.6603450   Max.   :0.651168   Max.   :0.6929   Max.   :0.6636  
##     peubah16          peubah17          peubah18         peubah19       
##  Min.   :0.01702   Min.   :0.06889   Min.   :0.1159   Min.   :0.001149  
##  1st Qu.:0.30723   1st Qu.:0.42082   1st Qu.:0.4327   1st Qu.:0.387125  
##  Median :0.34110   Median :0.45101   Median :0.4671   Median :0.425967  
##  Mean   :0.34021   Mean   :0.44853   Mean   :0.4646   Mean   :0.420817  
##  3rd Qu.:0.37417   3rd Qu.:0.47993   3rd Qu.:0.4996   3rd Qu.:0.456107  
##  Max.   :0.62958   Max.   :0.69082   Max.   :0.6912   Max.   :0.659270  
##     peubah20          peubah21       target   
##  Min.   :0.06297   Min.   :0.03861   0:10756  
##  1st Qu.:0.36750   1st Qu.:0.31216   1:10925  
##  Median :0.39835   Median :0.35353            
##  Mean   :0.39924   Mean   :0.35739            
##  3rd Qu.:0.43211   3rd Qu.:0.40094            
##  Max.   :0.65263   Max.   :0.69315

At glance, it seems all input variables in Training_Data have similar ranges for minimum value 0 to maximum value between 0.6 to 0.7. Almost all the medians are in range between 0.3 to 0.5.

Now let’s see the scatterplot matrix between all variables in Training_Data. To be complete, it should have shown 22 by 22 scatterplot panels but it will be difficult to see in this report. So we tried to divide the scattterplot matrix into three plots as below. Although they are not exhaustive plots, we can still get a good sense of how the Training_Data is distributed across all of its variable pairs.

We see that the data is scattered differently between the variable pairs. The blue and green colored dots in each panel represent data points for output classess 0 and 1. If we see in each panel, there is no variable that can significantly “separate” those blue and green dots.

library(graphics)
pairs(~ peubah1+peubah2+peubah3+peubah4+peubah5+peubah6+peubah7+target, 
      data = Training_Data, main = "peubah1 to peubah7 + target", 
      col = 3 + (Training_Data$target == 1), upper.panel = NULL)

pairs(~ peubah8+peubah9+peubah10+peubah11+peubah12+peubah13+peubah14+target, 
      data = Training_Data, main = "peubah8 to peubah14 + target", 
      col = 3 + (Training_Data$target == 1), upper.panel = NULL)

pairs(~ peubah15+peubah16+peubah17+peubah18+peubah19+peubah20+peubah21+target, 
      data = Training_Data, main = "peubah15 to peubah21 + target", 
      col = 3 + (Training_Data$target == 1), upper.panel = NULL)

Furthermore let’s see the Training_Data from another angle, from its input variables’ cross correlations. The correlations plot is shown below.

Apart from the diagonal, which is self-correlation for the variables (perfect 1.0 correlation), we also see high value cross-correlations with dark blue and dark red colored squares, i.e. values near to +1.0 or -1.0 respectively. This may imply a good number of co-linearities between different input variables which may suggest a dimension reduction can be performed to the data, for example with principal component analysis (PCA). We will discuss this later during our prediction modelling.

library(corrplot)
corrplot(cor(Training_Data[,-22]), type = "lower", method = "square")

Given our exploratory data analysis above, it is also interesting to see how the data is distributed for each variable. Those are shown in two plots below.

We see that distribution shape for input variables are “almost identical” by looking visually to the plot.

library(purrr)
library(tidyr)
library(ggplot2)

Training_Data[ ,-22] %>%
  keep(is.numeric) %>%
  gather %>%
  ggplot(aes(value)) +
  facet_wrap(~key, scales = "free", nrow = 3, ncol = 7) +
  geom_histogram(alpha = .6, fill = "navy", binwidth = 0.01, aes(y = ..density..))

g <- ggplot(data = Training_Data, aes(x = target))
g + geom_bar(fill = "navy", alpha = .6) + ggtitle("Count of '0' and '1' in target")

Overview of Log-Loss Function

Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric for classifiers. Before we start building a model, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted.

Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is basically equivalent to maximising the accuracy of the classifier.

In order to calculate Log Loss the classifier must assign a probability to each class rather than simply yielding the most likely class. The multi-class version of Logarithmic Loss metric is shown below. Each observation is in one class and for each observation, we submit a predicted probability for each class. The metric is negative the log likelihood of the model that says each test observation is chosen independently from a distribution that places the submitted probability mass on the corresponding class, for each observation.

\[logloss = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^M y_{i,j} \log (p_{i,j})\] where \(N\) is the number of observations, \(M\) is the number of class labels, \(log\) is the natural logarithm, \(y_{i,j}\) is 1 if observation \(i\) is in class \(j\) and 0 otherwise, and \(p_{i,j}\) is the predicted probability that observation \(i\) is in class \(j\).

If there are only two classes then the expression above simplifies to

\[logloss = - \frac{1}{N} \sum_{i=1}^{N} [y_i \log p_i + (1-y_i) \log (1 - p_i)]\] Note that for each instance only the term for the correct class actually contributes to the sum.

Developing the Model to Minimize Negative Log-Loss Function

Logistic Regression

The first model we would like to assess is the logistic regression using log linear model, since it is closely related to the log-loss function.

Log linear model assumes that the log-odds of the conditional probability of the target given the features is a weighted linear combination of features. These weights are the parameters of the model which we’d like to estimate.

If we assume independence of samples, our objective is to maximize the product of prediction probability of the data or equivalently minimize the sum of negative logarithm of prediction probabilities. The mathematical expression for the negative log prediction probability of the log-linear model for a given samples is called the log-loss.

You may wonder why log-linear models are used in the first place. It turns out that log-linear models are the least biased models which preserve the empirical feature probabilities. Intuitively, a reasonable model for an unknown binary probability distribution given some data is a maximum entropy distribution whose feature expectations agree with the empirical feature expectations in the data. We can prove this leads us to exponential family of distributions and log-linear models are the simplest instance of this.

We are going to implement our first model using train function from caret in R. Logistic regression is done through generalized linear model method glm. Please note that we need to set paramater family = "binomial" since our target variable has two classes output, which are '0' and '1'.

##logistic regression
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:caret':
## 
##     RMSE
## The following object is masked from 'package:base':
## 
##     Recall
library(stats)

glm.model <- train(target ~ ., data = Training_Data, method = "glm", family = "binomial")
summary(glm.model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.444  -1.183   1.033   1.167   1.727  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -6.1965     9.2464  -0.670   0.5028  
## peubah1       26.5443    40.2072   0.660   0.5091  
## peubah2       13.3293    12.6343   1.055   0.2914  
## peubah3      117.8905    66.3793   1.776   0.0757 .
## peubah4      -11.1402    12.8283  -0.868   0.3852  
## peubah5       19.1890    37.6330   0.510   0.6101  
## peubah6        0.5534     3.6806   0.150   0.8805  
## peubah7      -39.0299    15.9341  -2.449   0.0143 *
## peubah8       49.9100    21.5697   2.314   0.0207 *
## peubah9       -2.1985     4.5045  -0.488   0.6255  
## peubah10     -29.5315    18.6817  -1.581   0.1139  
## peubah11     -28.9118   128.8852  -0.224   0.8225  
## peubah12      35.9269    49.1800   0.731   0.4651  
## peubah13     -52.4888   104.9674  -0.500   0.6170  
## peubah14      55.7961    38.5336   1.448   0.1476  
## peubah15     -19.8469    16.7281  -1.186   0.2354  
## peubah16      26.6816    19.8542   1.344   0.1790  
## peubah17    -112.3319    61.1337  -1.837   0.0661 .
## peubah18     -66.3396    64.8757  -1.023   0.3065  
## peubah19      22.6035    11.5529   1.957   0.0504 .
## peubah20      -5.1605    24.6152  -0.210   0.8339  
## peubah21      11.1129    11.5298   0.964   0.3351  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30055  on 21680  degrees of freedom
## Residual deviance: 29972  on 21659  degrees of freedom
## AIC: 30016
## 
## Number of Fisher Scoring iterations: 3

If we see the result, we can see that no particular input variables is statistically significant as the predictor. All p-values for the variables are larger than 0.01. In other words, there is no strong assosication between any particular input variable with the probability of output target being in class 1.

Now we can run the anova() function on the model to analyze the table of deviance:

anova(glm.model$finalModel, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: .outcome
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     21680      30055              
## peubah1   1   2.1402     21679      30053 0.1434813    
## peubah2   1  13.4638     21678      30039 0.0002432 ***
## peubah3   1   0.2570     21677      30039 0.6121832    
## peubah4   1   4.4466     21676      30035 0.0349709 *  
## peubah5   1   5.4259     21675      30029 0.0198404 *  
## peubah6   1   6.1914     21674      30023 0.0128372 *  
## peubah7   1   8.9440     21673      30014 0.0027838 ** 
## peubah8   1  10.5703     21672      30004 0.0011492 ** 
## peubah9   1   0.0823     21671      30003 0.7741746    
## peubah10  1   5.9096     21670      29998 0.0150583 *  
## peubah11  1   4.8102     21669      29993 0.0282919 *  
## peubah12  1   4.1883     21668      29989 0.0407043 *  
## peubah13  1   0.0093     21667      29989 0.9232974    
## peubah14  1   0.3093     21666      29988 0.5781296    
## peubah15  1   0.0974     21665      29988 0.7549172    
## peubah16  1   4.5919     21664      29984 0.0321225 *  
## peubah17  1   6.6682     21663      29977 0.0098146 ** 
## peubah18  1   0.5798     21662      29976 0.4463804    
## peubah19  1   3.2008     21661      29973 0.0736029 .  
## peubah20  1   0.1733     21660      29973 0.6771516    
## peubah21  1   0.9287     21659      29972 0.3351983    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. We can see that peubah2 is relatively more significant to reduce the residual deviance.

Let’s calculate the negative log-loss for our first model as below. Here we evaluate the Logloss function from MLmetrics package and passing target as “true values” and model output’s fitted values as “prediction values” as y_true and y_pred respectively. This will yield negative log-loss value of 0.69, which is interesting and we will discuss this further at the end section of this report.

(glm.ll_train <- LogLoss(y_true = as.numeric(Training_Data$target)-1, 
                               y_pred = glm.model$finalModel$fitted.values))
## [1] 0.6912029

In the steps above, we briefly evaluated the fitting of the model and the accuracy using negative log-loss function, now we would like to see how the model is doing when predicting the output target variable. Here we can only evaluate miss-classification rate from Training_Data since our Test_Data does not have target variable column. By setting the parameter type = "response", R will output probabilities in the form of \(P(y=1|X)\). Our decision boundary will be 0.5. If \(P(y=1|X)\) > 0.5 then \(y\) = 1 otherwise \(y\) = 0. If we see from our exploratory data analysis above that number of output target = 1 is almost the same with the number of target = 0, then decision boundary of 0.5 seems a good fit.

glm.pred_train <- predict(glm.model$finalModel, Training_Data, type = "response")
glm.pred_train <- ifelse(glm.pred_train > 0.5, 1, 0)
(MissClassifError <- mean(glm.pred_train != Training_Data$target))
## [1] 0.4742862

The above result shows miss-classification error of about 47%, or output prediction accuracy of about 53%. This level of accuracy is expected given our exploratory data analysis above that has shown that similar number of 0 and 1 output classes and i.i.d property of all the input variables.

Then we can apply this model to predict probability of output target = 1 and its distribution for Test_Data. Result is shown below:

glm.pred_test <- predict(glm.model$finalModel, Test_Data, type = "response")
summary(glm.pred_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1594  0.4868  0.5045  0.5038  0.5218  0.7050

Logistic Regression with Pre-processing - Principal Component Analysis

We are also curious to see if we can adjust our first model further. Since the data has so many predictors (21 input variables) and quite few are statistically significant, we would like also to see if we can “simplify” the model while not reducing much of its log-loss metrics, and hence its prediction accuracy.

Here we introduce pre-processing of the data with PCA (principal component analysis) by passing through preProcess = "pca" parameter to our logistic regression model as below:

glm2.model <- train(target ~ ., data = Training_Data, method = "glm",
                    family = "binomial", preProcess = "pca")
summary(glm2.model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.343  -1.179   1.083   1.174   1.264  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.015624   0.013591   1.150    0.250    
## PC1          0.005344   0.003909   1.367    0.172    
## PC2         -0.009849   0.006402  -1.538    0.124    
## PC3          0.031700   0.007287   4.350 1.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30055  on 21680  degrees of freedom
## Residual deviance: 30032  on 21677  degrees of freedom
## AIC: 30040
## 
## Number of Fisher Scoring iterations: 3

As shown below, we see there are 3 PCA components are used to capture 95% of the variance of all input variables. Below is also shown the adjusted logistic regression model performance using anova() function over the PCA components.

glm2.model$preProcess
## Created from 21681 samples and 21 variables
## 
## Pre-processing:
##   - centered (21)
##   - ignored (0)
##   - principal component signal extraction (21)
##   - scaled (21)
## 
## PCA needed 3 components to capture 95 percent of the variance
anova(glm2.model$finalModel, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: .outcome
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                 21680      30055              
## PC1   1   1.8661     21679      30053    0.1719    
## PC2   1   2.3622     21678      30051    0.1243    
## PC3   1  18.9621     21677      30032 1.333e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We evaluate the negative log-loss for our adjusted model and we obtain 0.69 which is very similar with our first model. This suggests that our adjusted model’s classifier performance is pretty similar with our first model.

(glm2.ll_train <- LogLoss(y_true = as.numeric(Training_Data$target)-1,
                         y_pred = glm2.model$finalModel$fitted.values))
## [1] 0.692582
glm2.pred_train <- predict(glm2.model, Training_Data, type = "prob")[,2]

Let’s see our adjusted model’s accuracy by setting the same decision boundary of 0.5. Here we see a slight reduction accuracy from 53% to about 51% in our adjusted model.

glm2.pred_train <- ifelse(glm2.pred_train > 0.5, 1, 0)
(MissClassifError <- mean(glm2.pred_train != Training_Data$target))
## [1] 0.4862783

Again, if we apply the adjusted model to predict output from Test_Data dataset, we obtain a similar probabilities distribution with our initial model.

glm2.pred_test <- predict(glm2.model, Test_Data, type = "prob")[,2]
summary(glm.pred_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1594  0.4868  0.5045  0.5038  0.5218  0.7050

Linear Discriminant Analysis (LDA) Model

LDA model can be used for supervised training data, which is the case in our Training_Data dataset. Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques that are commonly used for dimensionality reduction. PCA’s goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA computes the directions (“linear discriminants”) that will represent the axes that that maximize the separation between multiple classes in our output variable.

We implement LDA model as below:

lda.model <- train(target ~ ., data = Training_Data, method = "lda")
## Loading required package: MASS

The fitted values (predictions) of Training_Data output target and the log-loss are computed as below, which again yiels negative log-loss value of 0.69. This is again similar with our previous two models above.

lda.pred_train <- predict(lda.model, Training_Data, type = "prob")[,2]
(lda.ll_train <- LogLoss(y_true = as.numeric(Training_Data$target)-1,
                        y_pred = lda.pred_train))
## [1] 0.6912032

This LDA model accuracy is also again at about 53%, suggesting similar prediction performance with the previous two models.

lda.pred_train <- ifelse(lda.pred_train > 0.5, 1, 0)
(MissClassifError <- mean(lda.pred_train != Training_Data$target))
## [1] 0.474194

Now let’s see prediction values over the Test_Data. This also results in similar probability distributions with our previous models.

lda.pred_test <- predict(lda.model, Test_Data, type = "prob")[,2]
summary(glm.pred_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1594  0.4868  0.5045  0.5038  0.5218  0.7050

Analysis and Conclusion

An important observation regarding our Training_Data when we did exploratory data analysis was their i.i.d feature when we plotted input variable distributions. Also when we observed that probability of outcomes of 0 and 1 are almost equal at about 0.5. This suggests naive bayes property for the input variables so we can just multiply the outputs probabilities to obtain our maximum likelihood function of our data. Taking the negative logarithmic of this maximum likelihood function and averaging across all \(N\) data samples then we get our negative log-loss function.

Also we see that probability of output for 1 or 0 at about 0.5 suggests maximum entropy of the data. Due to this condition, then we can only get minimum negative log-loss value at about 0.69. This is simply obtained by taking \(p_i\) = 0.5 and \(y_i\) = 1 to our binary log-loss function above.

So to conclude our analysis:

  1. Input variables in Training_Data are almost i.i.d

  2. Probability of output 1 is about 0.5, hence probability of output 0 is also about 0.5

  3. Negative log-loss function given this Training_Data set for different models are almost similar, at about 0.69. We can approximate this value by taking \(p_i\) = 0.5 and \(y_i\) = 1 to our binary log-loss function