## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: package 'caret' was built under R version 3.5.1

## Loading required package: lattice

## Warning: package 'pROC' was built under R version 3.5.1

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

## Warning: package 'ROCR' was built under R version 3.5.1

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.5.1

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

What is Logistic Regression?

Logistic regression is like a lot of other regression models in that it applies the maximum likelihood principle for parameter estimation. The most common approach is to use the Newton-Raphson method that we had learned about in module 2. Logistic regression is performed on the log odds scale (a logit) and the function that transform the log-odds probability is the logistic function. This function is used to predict a bivariate response, such as pass/fail, etc. This is usually denoted as a response variable that has values of either 0 or 1.

If we call our response variable C, then we can let \(p_{x}\) be the P(C=1 | \(X_{1}\) = \(x_{1}\), \(X_{2}\) = \(x_{2}\),\(\dots\) \(X_{K}\) = \(x_{k}\)). The logit model is then defined as:

log(\(\frac{p_x}{1-p_x}\)) = \(B_{0}\) + \(B_{1}\)*\(x_{1}\) + \(B_{2}\)*\(x_{2}\)+\(\cdots\) + \(B_{k}\)*\(x_{k}\)

or equivalently,

\(p_{x} = \displaystyle \frac{1}{1+e^{-(B_{0} + B_{1}*x_{1} + B_{2}*x_{2}+\cdots + B_{k}*x_{k})}}\)

This function is called a sigmoid. We can define a sigmoid as

sigmoid = function(z){
  1/(1+exp(-z))
}
#sigmoid 0 is .5
sigmoid(0)

## [1] 0.5

#Large values of sigmoid is 1
sigmoid(100^100)

## [1] 1

#Large negative values of sigmoid is 0
sigmoid(-100^100)

## [1] 0

#plot of sigmoid
sig_x = seq(-10,10,by=.1)
sig_y = sigmoid(sig_x)
plot(sig_x,sig_y,type = "l")

When we take the likehood function with regards to the Betas, we get

\(L(B) = \prod_{j=1}^{N} p_{x_j}^{c_j}*{(1-p_{x_j})^{1-c_j}}\)

Then we take the log of that likehood to get: \(l(B) = \sum_{j=1}^{N} (c_j*log(p_{x_j})+ (1-c_j)*log(1-p_{x_j}))\)

which is equivalent to \(\sum_{j=1}^{N}c_j*log(\frac{p_{x_j}}{1-p_{x_j}}) + \sum_{j=1}^{N}log(1-p_{x_j})\)

Using or original equations, we can then obtain

\(l(B) = \sum_{j=1}^{N}c_j*(B_0+B_1*x_{1_j} + B_2*x_{2_j} + \cdots + B_k*x_{k_j} - sum_{j=1}^{N}log(1+e^{B_0+B_1*x_{1_j} + B_2*x_{2_j} + \cdots + B_k*x_{k_j}})\)

We can then differentiate according to each Beta and get a system of equations to solve:

\(\frac{dlogL(B)}{dB_0} = \sum_{j=1}^{N}c_j - \sum_{j=1}^{N}\frac{1}{1+e^{-(B_{0} + B_{1}*x_{1} + B_{2}*x_{2}+\cdots + B_{k}*x_{k})}}\)

\(\cdots\)

\(\frac{dlogL(B)}{dB_k} = \sum_{j=1}^{N}c_j*x_{j_k} - \sum_{j=1}^{N} x_{j_k} \frac{1}{1+e^{-(B_{0} + B_{1}*x_{1} + B_{2}*x_{2}+\cdots + B_{k}*x_{k})}}\)

How do we solve this?

Unfortunately, this system of equations is not solveable analytically. However, this is where we can take what we have learned in this class and apply it. This sort of problem is great for Newton Raphson to solve.

Before we go on, let’s take a look at how normal linear regression is optimized. As we learned in module 2, a linear equation can be optimized by minimizing its residuals. In the case of linear regression, the residuals can be expressed as the \(L_2\) norm, where

\(S_p(\theta) = \sum_{i=1}^{n}|y_i - f(\theta)|^{p}\)

is defined as the \(L_p\) norm, and p = 2. Upon researching online, this is also called the “cost function”. We can use any of the methods listed in module 2 to compute this, but let’s look at an example with an ascent algorithm, specifically gradient descent.

x = runif(1000,-10,10)
y = x + rnorm(1000,0,2)+1
plot(x,y)

We can see that this is roughly a linear line with some noise added in. (Example partially from https://www.r-bloggers.com/linear-regression-by-gradient-descent/)

#Setting up the cost function for linear regression
cost = function(X,y,theta){
  sum( (X %*% theta - y)^2 / (2*length(y)))
}

alpha = .02
iter = 1000

cost_history = double(iter)
theta_history = list(iter)

#initialize coefficients
theta = matrix(c(0,0),nrow = 2)

#add a column of 1's for the intercept coefficient
X = cbind(1, matrix(x))

#gradient descent
for(i in 1:iter){
  error = (X%*%theta - y) #This will produce an nx1 matrix of errors for each y_i to y_hat
  delta = t(X) %*% error / length(y)#creates an updating function for the linear regression based on the weighted average error for each coefficient
  theta = theta -alpha *delta #eventually alpha*delta -> 0
  cost_history[i] = cost(X,y,theta)
  theta_history[[i]] = theta
}

print(theta)

##          [,1]
## [1,] 0.962082
## [2,] 1.010703

We can see that this is very close to the glm function that R uses to optimize the coefficients of a function:

linear_reg = glm(y~x)
print(linear_reg$coefficients)

## (Intercept)           x 
##    0.962082    1.010703

We can see in the problem set 2, for part d of problem 1, we did something extremely similar. The updating function in this case is the alpha of .01 times the average error for each coefficient. Eventually, this will lead to a minimum (global or local) if the function converges.

This can also be done with the optim function:

theta_2 = matrix(c(0,0),nrow = 2)
optim(par = theta_2,fn=cost, X=X, y =y, method = "BFGS")

## $par
##          [,1]
## [1,] 0.962082
## [2,] 1.010703
## 
## $value
## [1] 1.812126
## 
## $counts
## function gradient 
##       14        3 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Some brief explanations: method = “BFGS” uses a pseudo Newton method, like the ones we covered in class. In the results, par means the best estimate for the two parameters. value means the value of the function at the values of par. Counts means the number of calls to fn and gr respectively. Convergence of 0 means successful completion, otherwise an error occurred.

Logistic Regression

The idea for logistic regression is similar for linear regression, except that the response that we are predicting is an odds. However, the cost function that we used for linear regression does not perform as well here (mean-squared error). This is mainly because our equation is no longer linear due to the transformation we applied. Squaring the prediction, as in our L_2 estimator, does not work because it will produce a non-convex function with many local minimums.

https://www.internalpointers.com/post/cost-function-logistic-regression

The cost function that we will use is one called “Cross-Entropy”. The advantage of this method is that it can be divided into two separate cost functions: one for y=1 and one for y=0. Doesn’t this sound convenient for a logistic regression? Anyways, the cost function is as follows:

\(J(\theta) = \frac{1}{m}*\sum_{i=1}^{m}Cost(h_\theta*(x^{(i)}),y^{(i)})\)

And the cost differs by y=1 and 0: \(Cost(h_\theta*(x^{(i)}),y^{(i)}) = -log(h_\theta(x))\) if y=1

\(Cost(h_\theta*(x^{(i)}),y^{(i)}) = -log(1-h_\theta(x))\) if y=0

The advantages of taking the log of this cost function can be show here:

https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html#cost-function

As we can see, these charts show that this cost function really heavily penalizes confident and wrong predictions.

We can combine these two cost functions through some clever mathematics:

\(J(\theta) = -\frac{1}{m} * \sum_{i=1}^{m}[y^{(i)}log(h_\theta(x^{(i)})) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))]\)

Since \(y^{(i)}\) is either 0 or 1, the appropriate cost function will be evaluated in case y is 1 or 0.

We will follow allow with an example I am partially deriving from https://www.r-bloggers.com/logistic-regression-regularized-with-optimization/

ex2data1 = fread("ex2data1.txt",col.names=c("Exam1","Exam2","Label"))
logistic_X = cbind(1,ex2data1$Exam1,ex2data1$Exam2)
ggplot(ex2data1,aes(x=Exam1, y=Exam2, col = factor(Label)))+
  geom_point()

It looks like there is a somewhat clear delineation in the data between the admitted(1) and not admitted(0).

Here we can input our cost function:

cost_logistic = function(X, y, theta){
  h_theta = sigmoid(X%*%theta)
  cost = 1/nrow(X)*sum(-y*log(h_theta)-(1-y)*log(1-h_theta))
  return(cost)
}
logistic_cost_history = double(iter)
logistic_theta_history = list(iter)

#The function below calculates gradient.
#Shamelessly taken from the site mentioned above

my_gradient=function(theta, X, y){
    
    h_theta= sigmoid(X%*%theta)
    
    ## OPTION 1-- looping
    
    #    gradient=rep(0,ncol(X))
    #      for(j in 1:ncol(X)){
    #       for(i in 1:nrow(X)){
    #    gradient[j]=gradient[j]+1/nrow(X)*(my_sigmoid(X[i,]%*%theta)-y[i])*X[i,j]
    #                   }
    #               }
    
    # option2-more succint
   gradient= 1/nrow(X)*(sigmoid(t(theta)%*%t(X))-t(y))%*%X
    return(gradient)                                                                           
}


initial_theta =matrix(rep(0,ncol(logistic_X)))
my_gradient(initial_theta,logistic_X,ex2data1$Label)

##      [,1]      [,2]      [,3]
## [1,] -0.1 -12.00922 -11.26284

#We can see the inital values for the gradient at our inital theta

optim(par = initial_theta,X=logistic_X,y=ex2data1$Label,fn=cost_logistic, gr = my_gradient)

## $par
##             [,1]
## [1,] -25.1647048
## [2,]   0.2062524
## [3,]   0.2015046
## 
## $value
## [1] 0.2034977
## 
## $counts
## function gradient 
##      156       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

optim(par = initial_theta,X=logistic_X,y=ex2data1$Label,fn=cost_logistic, method = "Nelder-Mead")

## $par
##             [,1]
## [1,] -25.1647048
## [2,]   0.2062524
## [3,]   0.2015046
## 
## $value
## [1] 0.2034977
## 
## $counts
## function gradient 
##      156       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

optim(par = initial_theta,X=logistic_X,y=ex2data1$Label,fn=cost_logistic, method = "BFGS")

## $par
##             [,1]
## [1,] -22.4364555
## [2,]   0.1841454
## [3,]   0.1795486
## 
## $value
## [1] 0.2047206
## 
## $counts
## function gradient 
##      194      100 
## 
## $convergence
## [1] 1
## 
## $message
## NULL

glm(Label~Exam1+Exam2, data=ex2data1, family = 'binomial')

## 
## Call:  glm(formula = Label ~ Exam1 + Exam2, family = "binomial", data = ex2data1)
## 
## Coefficients:
## (Intercept)        Exam1        Exam2  
##    -25.1613       0.2062       0.2015  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  97 Residual
## Null Deviance:       134.6 
## Residual Deviance: 40.7  AIC: 46.7

We can see that the optim function with our user specified gradient function performed the same as the one with the default method “Nelder-Mead”. These two are the same as the one with glm, which is R’s built in function for optimization. However, we can see that the optim function with our previous “BFGS” is not as accurate.

Since we have shown how the glm() function in R works, we will us this function to create our logistic regressions from now on.

Practice

Let us practice some logistic regression now.

pulsar = read.csv("pulsar_stars.csv")
pulsar$target_class_fct = as.factor(pulsar$target_class)
str(pulsar)

## 'data.frame':    17898 obs. of  10 variables:
##  $ Mean.of.the.integrated.profile              : num  140.6 102.5 103 136.8 88.7 ...
##  $ Standard.deviation.of.the.integrated.profile: num  55.7 58.9 39.3 57.2 40.7 ...
##  $ Excess.kurtosis.of.the.integrated.profile   : num  -0.2346 0.4653 0.3233 -0.0684 0.6009 ...
##  $ Skewness.of.the.integrated.profile          : num  -0.7 -0.515 1.051 -0.636 1.123 ...
##  $ Mean.of.the.DM.SNR.curve                    : num  3.2 1.68 3.12 3.64 1.18 ...
##  $ Standard.deviation.of.the.DM.SNR.curve      : num  19.1 14.9 21.7 21 11.5 ...
##  $ Excess.kurtosis.of.the.DM.SNR.curve         : num  7.98 10.58 7.74 6.9 14.27 ...
##  $ Skewness.of.the.DM.SNR.curve                : num  74.2 127.4 63.2 53.6 252.6 ...
##  $ target_class                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ target_class_fct                            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

pulsar_logistic = glm(target_class_fct~Mean.of.the.integrated.profile + Standard.deviation.of.the.integrated.profile + Excess.kurtosis.of.the.integrated.profile + Skewness.of.the.integrated.profile + Mean.of.the.DM.SNR.curve + Standard.deviation.of.the.DM.SNR.curve + Excess.kurtosis.of.the.DM.SNR.curve +Skewness.of.the.DM.SNR.curve , data = pulsar,  family = 'binomial')
pulsar_logistic

## 
## Call:  glm(formula = target_class_fct ~ Mean.of.the.integrated.profile + 
##     Standard.deviation.of.the.integrated.profile + Excess.kurtosis.of.the.integrated.profile + 
##     Skewness.of.the.integrated.profile + Mean.of.the.DM.SNR.curve + 
##     Standard.deviation.of.the.DM.SNR.curve + Excess.kurtosis.of.the.DM.SNR.curve + 
##     Skewness.of.the.DM.SNR.curve, family = "binomial", data = pulsar)
## 
## Coefficients:
##                                  (Intercept)  
##                                     -9.01995  
##               Mean.of.the.integrated.profile  
##                                      0.03026  
## Standard.deviation.of.the.integrated.profile  
##                                     -0.03543  
##    Excess.kurtosis.of.the.integrated.profile  
##                                      6.57706  
##           Skewness.of.the.integrated.profile  
##                                     -0.61624  
##                     Mean.of.the.DM.SNR.curve  
##                                     -0.02858  
##       Standard.deviation.of.the.DM.SNR.curve  
##                                      0.05317  
##          Excess.kurtosis.of.the.DM.SNR.curve  
##                                      0.04775  
##                 Skewness.of.the.DM.SNR.curve  
##                                     -0.00475  
## 
## Degrees of Freedom: 17897 Total (i.e. Null);  17889 Residual
## Null Deviance:       10960 
## Residual Deviance: 2616  AIC: 2634

In this simple example, we just took the entire dataset and ran a simple glm on it. This gave us (a rather uninformative) log odds coefficent for each variable. We can get a slightly more informative answer by exponentiating it.

exp(pulsar_logistic$coefficients)

##                                  (Intercept) 
##                                 1.209717e-04 
##               Mean.of.the.integrated.profile 
##                                 1.030722e+00 
## Standard.deviation.of.the.integrated.profile 
##                                 9.651900e-01 
##    Excess.kurtosis.of.the.integrated.profile 
##                                 7.184261e+02 
##           Skewness.of.the.integrated.profile 
##                                 5.399691e-01 
##                     Mean.of.the.DM.SNR.curve 
##                                 9.718201e-01 
##       Standard.deviation.of.the.DM.SNR.curve 
##                                 1.054604e+00 
##          Excess.kurtosis.of.the.DM.SNR.curve 
##                                 1.048907e+00 
##                 Skewness.of.the.DM.SNR.curve 
##                                 9.952616e-01

We can see that the odds of the star being a pulsar star increases dramatically with excess kurtosis, and not much else.

But we’re using the whole dataset here. How will we know how well it can predict on values not within the dataset? As we learned in class in module 8, we can use cross-validation in order to measure the accuracy of the model on data it hasn’t seen. As a brief reminder of how cross validation works:

Divide the sample into K approx equal sized subsets (the standard is to use 10)
“Hold” each subset and use the rest of the data to “train” the model and get the coefficients
Measure the prediction error by using the subset held back
The K estimates of the prediction error can be averaged to find an overall estimate.

The r package caret provides a convenient method to do cross validation

train_control = trainControl(method = "cv",number = 10,savePredictions = TRUE)
pulsar_lg_cv = train(target_class_fct~Mean.of.the.integrated.profile + Standard.deviation.of.the.integrated.profile + Excess.kurtosis.of.the.integrated.profile + Skewness.of.the.integrated.profile + Mean.of.the.DM.SNR.curve + Standard.deviation.of.the.DM.SNR.curve + Excess.kurtosis.of.the.DM.SNR.curve +Skewness.of.the.DM.SNR.curve , data = pulsar, trControl = train_control, method= 'glm',family="binomial")
summary(pulsar_lg_cv)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3813  -0.1644  -0.1000  -0.0565   3.6178  
## 
## Coefficients:
##                                               Estimate Std. Error z value
## (Intercept)                                  -9.019954   0.977017  -9.232
## Mean.of.the.integrated.profile                0.030260   0.005925   5.107
## Standard.deviation.of.the.integrated.profile -0.035430   0.010386  -3.411
## Excess.kurtosis.of.the.integrated.profile     6.577063   0.300110  21.916
## Skewness.of.the.integrated.profile           -0.616243   0.039709 -15.519
## Mean.of.the.DM.SNR.curve                     -0.028585   0.003268  -8.747
## Standard.deviation.of.the.DM.SNR.curve        0.053166   0.007364   7.220
## Excess.kurtosis.of.the.DM.SNR.curve           0.047749   0.085758   0.557
## Skewness.of.the.DM.SNR.curve                 -0.004750   0.003051  -1.557
##                                              Pr(>|z|)    
## (Intercept)                                   < 2e-16 ***
## Mean.of.the.integrated.profile               3.27e-07 ***
## Standard.deviation.of.the.integrated.profile 0.000647 ***
## Excess.kurtosis.of.the.integrated.profile     < 2e-16 ***
## Skewness.of.the.integrated.profile            < 2e-16 ***
## Mean.of.the.DM.SNR.curve                      < 2e-16 ***
## Standard.deviation.of.the.DM.SNR.curve       5.21e-13 ***
## Excess.kurtosis.of.the.DM.SNR.curve          0.577676    
## Skewness.of.the.DM.SNR.curve                 0.119474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10959.5  on 17897  degrees of freedom
## Residual deviance:  2615.8  on 17889  degrees of freedom
## AIC: 2633.8
## 
## Number of Fisher Scoring iterations: 8

pulsar_lg_cv$results

##   parameter  Accuracy     Kappa  AccuracySD   KappaSD
## 1      none 0.9791597 0.8672609 0.002660303 0.0181324

We can see we get the same coefficients as the last run. However, now we are also able to see the accuracy of the regression on the cross-validated samples. It looks like the accuracy predicts decently well.

We can also run the varImp function in the caret package in order to get the variable importance of each feature:

varImp(pulsar_lg_cv)

## glm variable importance
## 
##                                              Overall
## Excess.kurtosis.of.the.integrated.profile    100.000
## Skewness.of.the.integrated.profile            70.052
## Mean.of.the.DM.SNR.curve                      38.344
## Standard.deviation.of.the.DM.SNR.curve        31.195
## Mean.of.the.integrated.profile                21.305
## Standard.deviation.of.the.integrated.profile  13.364
## Skewness.of.the.DM.SNR.curve                   4.683
## Excess.kurtosis.of.the.DM.SNR.curve            0.000

It seems like we were right in that the Excess Kurtosis of the integrated profile plays the most important role, while it seems that we can remove the Excess kurtosis of the DM.SNR.curve without much loss.

The receiving operating characteristic (ROC) is a measure of classifier performance. This basically uses the proportion of positive data points that are correctly classified as positive vs. the proportion of data points that are incorrectly considered as positive. This can be used to generate a graphic that shows the tradeoff between the rate of correct prediction vs. the rate of incorrect prediction. Ultimately, we will be concerned about AUC, or the area under the curve. We can also create a ROC curve using the pROC library.

auc(pulsar$target_class_fct, pulsar_lg_cv$finalModel$fitted.values)

## Area under the curve: 0.9764

plot.roc(pulsar$target_class_fct,
         pulsar_lg_cv$finalModel$fitted.values)

We can see that this model allows us to select a value with both a large sensitivity and specificity, hence the large AUC and the large accuracy.

Testing on some Pokemon data

This is just a quick application of what we have gone over on another dataset (and also to satisfy my own curiosity).

pokemon = read.csv("Pokemon.csv")
str(pokemon)

## 'data.frame':    800 obs. of  13 variables:
##  $ X.        : int  1 2 3 3 4 5 6 6 6 7 ...
##  $ Name      : Factor w/ 800 levels "Abomasnow","AbomasnowMega Abomasnow",..: 81 330 746 747 103 104 100 101 102 666 ...
##  $ Type.1    : Factor w/ 18 levels "Bug","Dark","Dragon",..: 10 10 10 10 7 7 7 7 7 18 ...
##  $ Type.2    : Factor w/ 19 levels "","Bug","Dark",..: 15 15 15 15 1 1 9 4 9 1 ...
##  $ Total     : int  318 405 525 625 309 405 534 634 634 314 ...
##  $ HP        : int  45 60 80 80 39 58 78 78 78 44 ...
##  $ Attack    : int  49 62 82 100 52 64 84 130 104 48 ...
##  $ Defense   : int  49 63 83 123 43 58 78 111 78 65 ...
##  $ Sp..Atk   : int  65 80 100 122 60 80 109 130 159 50 ...
##  $ Sp..Def   : int  65 80 100 120 50 65 85 85 115 64 ...
##  $ Speed     : int  45 60 80 80 65 80 100 100 100 43 ...
##  $ Generation: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Legendary : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...

pokemon_lg_cv = train(Legendary~ Type.1 + Type.2 + HP+Attack+Defense+Sp..Atk+Sp..Def+Speed+Generation, data= pokemon, trControl = train_control, method = 'glm', family='binomial')

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(pokemon_lg_cv)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.87606  -0.03379  -0.00095  -0.00001   2.10325  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.119e+01  2.256e+03  -0.023 0.981899    
## Type.1Dark      1.665e+01  2.256e+03   0.007 0.994111    
## Type.1Dragon    1.600e+01  2.256e+03   0.007 0.994343    
## Type.1Electric  1.836e+01  2.256e+03   0.008 0.993509    
## Type.1Fairy     1.355e+01  2.256e+03   0.006 0.995207    
## Type.1Fighting -8.192e-01  4.427e+03   0.000 0.999852    
## Type.1Fire      1.754e+01  2.256e+03   0.008 0.993798    
## Type.1Flying    1.915e+01  2.256e+03   0.008 0.993227    
## Type.1Ghost     1.730e+01  2.256e+03   0.008 0.993881    
## Type.1Grass     1.727e+01  2.256e+03   0.008 0.993893    
## Type.1Ground    1.875e+01  2.256e+03   0.008 0.993371    
## Type.1Ice       1.805e+01  2.256e+03   0.008 0.993617    
## Type.1Normal    1.335e+01  2.256e+03   0.006 0.995278    
## Type.1Poison    2.645e+00  4.745e+03   0.001 0.999555    
## Type.1Psychic   1.798e+01  2.256e+03   0.008 0.993641    
## Type.1Rock      1.831e+01  2.256e+03   0.008 0.993524    
## Type.1Steel     1.752e+01  2.256e+03   0.008 0.993806    
## Type.1Water     1.588e+01  2.256e+03   0.007 0.994386    
## Type.2Bug      -1.437e+01  1.415e+04  -0.001 0.999190    
## Type.2Dark     -5.166e+00  1.963e+00  -2.632 0.008485 ** 
## Type.2Dragon   -3.131e+00  1.273e+00  -2.459 0.013940 *  
## Type.2Electric -1.212e-02  4.165e+00  -0.003 0.997679    
## Type.2Fairy    -2.509e+00  1.730e+00  -1.450 0.147000    
## Type.2Fighting -1.120e+00  9.696e-01  -1.155 0.247944    
## Type.2Fire     -1.064e+00  1.744e+00  -0.610 0.541966    
## Type.2Flying   -1.348e-01  8.346e-01  -0.162 0.871658    
## Type.2Ghost    -8.753e-01  2.244e+00  -0.390 0.696517    
## Type.2Grass    -1.842e+01  4.211e+03  -0.004 0.996509    
## Type.2Ground   -2.601e+00  1.818e+00  -1.430 0.152592    
## Type.2Ice      -1.615e+00  1.817e+00  -0.889 0.374196    
## Type.2Normal   -1.872e+01  1.265e+04  -0.001 0.998820    
## Type.2Poison   -1.797e+01  3.365e+03  -0.005 0.995740    
## Type.2Psychic  -6.486e-01  1.118e+00  -0.580 0.561844    
## Type.2Rock     -1.579e+01  5.994e+03  -0.003 0.997899    
## Type.2Steel    -4.222e-01  1.602e+00  -0.264 0.792145    
## Type.2Water    -2.931e-01  1.815e+00  -0.161 0.871703    
## HP              6.695e-02  1.640e-02   4.082 4.46e-05 ***
## Attack          3.030e-02  1.026e-02   2.954 0.003137 ** 
## Defense         4.825e-02  1.255e-02   3.845 0.000121 ***
## Sp..Atk         5.193e-02  1.275e-02   4.072 4.65e-05 ***
## Sp..Def         6.222e-02  1.351e-02   4.607 4.09e-06 ***
## Speed           6.571e-02  1.452e-02   4.524 6.06e-06 ***
## Generation      6.437e-01  2.058e-01   3.127 0.001764 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 450.90  on 799  degrees of freedom
## Residual deviance: 118.25  on 757  degrees of freedom
## AIC: 204.25
## 
## Number of Fisher Scoring iterations: 20

varImp(pokemon_lg_cv)

## glm variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                 Overall
## Sp..Def        100.0000
## Speed           98.2075
## HP              88.6131
## Sp..Atk         88.3995
## Defense         83.4610
## Generation      67.8835
## Attack          64.1187
## Type.2Dark      57.1328
## Type.2Dragon    53.3708
## Type.2Fairy     31.4764
## Type.2Ground    31.0472
## Type.2Fighting  25.0759
## Type.2Ice       19.2862
## Type.2Fire      13.2342
## Type.2Psychic   12.5886
## Type.2Ghost      8.4625
## Type.2Steel      5.7165
## Type.2Flying     3.5029
## Type.2Water      3.5016
## Type.1Flying     0.1803

References

https://www.r-bloggers.com/logistic-regression-regularized-with-optimization/

https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html#cost-function

https://www.r-bloggers.com/linear-regression-by-gradient-descent/

https://rpubs.com/fhlgood/optlr

https://www.internalpointers.com/post/cost-function-logistic-regression

https://www.r-bloggers.com/logistic-regression-regularized-with-optimization/

https://www.r-bloggers.com/evaluating-logistic-regression-models/

https://stackoverflow.com/questions/31138751/roc-curve-from-training-data-in-caret

Yichen Dong Final