Data 622-Test 1

1) What is null deviance in the context of Logistic Regression?

The deviance is a concept in generalized linear models that measures the fitted generalized linear model with respect to a perfect model known as the saturated model.

In logistic regression this would be the model such that: \(\hat{P}[Y = 1|X_1 = X_{i1}, ..., X_k = X_{ip}] = Y_i,\) i = 1, 2, 3, n.

The null deviance is a generalization of the total sum of squares of the linear model. The null deviance shows how well the model predicts the response variable with only the intercept. This null deviance is benchmark for evaluating the scale of the deviance in a Logistic Regression.

2) How can we use null deviance?

The degree of freedom becomes very important here, it involves number of observations – number of predictors. The size of the null deviance can determine how well the model explains the data. For example, if the null deviance is very small, it most likely means the model is doing well explaining the data. The null deviance helps us to understand if additional independent variables are needed based on the degree of freedom and serves for comparing how much the model has improved by adding the predictors or independent variables.

3) Can the Null Deviance be negative?

No. The deviance is defined as the difference of likelihoods between the fitted model and the saturated model as shown below.

\(D = -2loglik(\hat{\beta}) + 2 loglik(mod)\) where mod = saturated model

Since the likelihood of the saturated model is exactly one, the deviance is simply another expression of the likelihood. Hence, the deviance is always larger or equal to zero.

Where zero will likely occur only if the model is a perfect fit.

4) Why is Logistic Regression, a classifier called regression?

Logistic regression falls under the category of supervised learning; it measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function.

Logistic Regression is a regression algorithm with its output falling on the sigmoid curve between 0 and 1. It’s important to consider its output as probabilities which is a classifier. Almost all classifiers return probabilities along with their 0/1 or multi-class decisions.

It is a classification problem which is used to predict a binary outcome (1/0, -1/1, True/False) given a set of independent variables. The interested output in logistic regression is p it gives the S shaped curve where p is always between 0 and 1.

\(l = logit(p) = ln(\frac{p}{(1-p)})\)

The value of p is continuous between 0 and 1 and can be used to decide a threshold value which is used for classification.

5) Our group just completed a logistic run and generated all the betas. Now, our clients send us a revised dataset where all the labels are flipped from 0 to 1 and 1 to 0, and with NO other change.

What should we do now? Should we re-run and estimate the betas again? Explain.

In Logistic regression, we tend to model the 1s, which means the 1s category will be compared to the 0 category. I will demonstrate the necessity of re-running the model and the impact to the Betas if the labels where flipped using the data set provided for our first modeling problem.

setwd("C:/Users/Emahayz_Pro/Desktop")

clientdata <- read.csv(file="data622.csv", header=TRUE, sep=",")

str (clientdata)

## 'data.frame':    36 obs. of  3 variables:
##  $ X    : int  5 5 5 5 5 5 19 19 19 19 ...
##  $ Y    : chr  "      a" "      b" "      c" "      d" ...
##  $ label: chr  "      BLUE" "      BLACK" "      BLUE" "      BLACK" ...

We have label as the response variable with two factor levels “BLUE” and “BLACK”. I will re-code “BLUE” = 1 and “BLACK” = 0 for my GLM model.

Logistic Regression with Initial Labels

I will add script to re-code the labels where “BLUE” is 1 and “BLACK” is 0.

clientdata$label <- ifelse(clientdata$label=="BLUE",1,0)

ClientModel <- glm(label ~., data = clientdata, family = binomial())

summary(ClientModel)

## 
## Call:
## glm(formula = label ~ ., family = binomial(), data = clientdata)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.971e-06  -3.971e-06  -3.971e-06  -3.971e-06  -3.971e-06  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.557e+01  1.478e+05       0        1
## X            2.281e-31  2.509e+03       0        1
## Y     b      3.341e-30  1.366e+05       0        1
## Y     c      3.550e-30  1.366e+05       0        1
## Y     d      9.806e-31  1.366e+05       0        1
## Y     e      1.965e-30  1.366e+05       0        1
## Y     f      1.495e-30  1.366e+05       0        1
## Y      a    -1.932e-13  2.566e+05       0        1
## Y      b    -2.643e-29  2.566e+05       0        1
## Y      c    -1.720e-31  2.566e+05       0        1
## Y      d    -1.596e-29  2.566e+05       0        1
## Y      e    -6.993e-30  2.566e+05       0        1
## Y      f    -1.156e-29  2.566e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 35  degrees of freedom
## Residual deviance: 5.6772e-10  on 23  degrees of freedom
## AIC: 26
## 
## Number of Fisher Scoring iterations: 24

Flipping the Labels

I will modify the script to re-code the labels where “BLUE” is now 0 and “BLACK” is now 1.

clientdata$label <- ifelse(clientdata$label=="BLUE",0,1)

ClientModel1 <- glm(label ~., data = clientdata, family = binomial())

summary(ClientModel1)

## 
## Call:
## glm(formula = label ~ ., family = binomial(), data = clientdata)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## 3.971e-06  3.971e-06  3.971e-06  3.971e-06  3.971e-06  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2.557e+01  1.478e+05       0        1
## X            1.104e-26  2.509e+03       0        1
## Y     b     -1.682e-25  1.366e+05       0        1
## Y     c     -3.095e-25  1.366e+05       0        1
## Y     d     -2.747e-26  1.366e+05       0        1
## Y     e     -1.350e-25  1.366e+05       0        1
## Y     f      0.000e+00  1.366e+05       0        1
## Y      a    -3.672e-09  2.566e+05       0        1
## Y      b    -2.173e-24  2.566e+05       0        1
## Y      c    -5.422e-26  2.566e+05       0        1
## Y      d     5.367e-25  2.566e+05       0        1
## Y      e    -4.320e-25  2.566e+05       0        1
## Y      f    -1.672e-25  2.566e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 35  degrees of freedom
## Residual deviance: 5.6772e-10  on 23  degrees of freedom
## AIC: 26
## 
## Number of Fisher Scoring iterations: 24

Even with an imperfect data which seems to produce an almost perfect model with the Null Deviance near zero, we can see that the estimates of the betas changed.

Flipping the class labels in the test data will switch the true positive and false positive rate at each threshold but will doubt if the accuracy of the logistic regression will change if simply switching the class labels.

Data 622-Test 1

Emmanuel Hayble-Gomes

9/13/2020