##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'caret' was built under R version 3.5.1
## Loading required package: lattice
## Warning: package 'pROC' was built under R version 3.5.1
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Warning: package 'ROCR' was built under R version 3.5.1
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.5.1
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
Logistic regression is like a lot of other regression models in that it applies the maximum likelihood principle for parameter estimation. The most common approach is to use the Newton-Raphson method that we had learned about in module 2. Logistic regression is performed on the log odds scale (a logit) and the function that transform the log-odds probability is the logistic function. This function is used to predict a bivariate response, such as pass/fail, etc. This is usually denoted as a response variable that has values of either 0 or 1.
If we call our response variable C, then we can let \(p_{x}\) be the P(C=1 | \(X_{1}\) = \(x_{1}\), \(X_{2}\) = \(x_{2}\),\(\dots\) \(X_{K}\) = \(x_{k}\)). The logit model is then defined as:
log(\(\frac{p_x}{1-p_x}\)) = \(B_{0}\) + \(B_{1}\)*\(x_{1}\) + \(B_{2}\)*\(x_{2}\)+\(\cdots\) + \(B_{k}\)*\(x_{k}\)
or equivalently,
\(p_{x} = \displaystyle \frac{1}{1+e^{-(B_{0} + B_{1}*x_{1} + B_{2}*x_{2}+\cdots + B_{k}*x_{k})}}\)
This function is called a sigmoid. We can define a sigmoid as
sigmoid = function(z){
1/(1+exp(-z))
}
#sigmoid 0 is .5
sigmoid(0)
## [1] 0.5
#Large values of sigmoid is 1
sigmoid(100^100)
## [1] 1
#Large negative values of sigmoid is 0
sigmoid(-100^100)
## [1] 0
#plot of sigmoid
sig_x = seq(-10,10,by=.1)
sig_y = sigmoid(sig_x)
plot(sig_x,sig_y,type = "l")
When we take the likehood function with regards to the Betas, we get
\(L(B) = \prod_{j=1}^{N} p_{x_j}^{c_j}*{(1-p_{x_j})^{1-c_j}}\)
Then we take the log of that likehood to get: \(l(B) = \sum_{j=1}^{N} (c_j*log(p_{x_j})+ (1-c_j)*log(1-p_{x_j}))\)
which is equivalent to \(\sum_{j=1}^{N}c_j*log(\frac{p_{x_j}}{1-p_{x_j}}) + \sum_{j=1}^{N}log(1-p_{x_j})\)
Using or original equations, we can then obtain
\(l(B) = \sum_{j=1}^{N}c_j*(B_0+B_1*x_{1_j} + B_2*x_{2_j} + \cdots + B_k*x_{k_j} - sum_{j=1}^{N}log(1+e^{B_0+B_1*x_{1_j} + B_2*x_{2_j} + \cdots + B_k*x_{k_j}})\)
We can then differentiate according to each Beta and get a system of equations to solve:
\(\frac{dlogL(B)}{dB_0} = \sum_{j=1}^{N}c_j - \sum_{j=1}^{N}\frac{1}{1+e^{-(B_{0} + B_{1}*x_{1} + B_{2}*x_{2}+\cdots + B_{k}*x_{k})}}\)
\(\cdots\)
\(\frac{dlogL(B)}{dB_k} = \sum_{j=1}^{N}c_j*x_{j_k} - \sum_{j=1}^{N} x_{j_k} \frac{1}{1+e^{-(B_{0} + B_{1}*x_{1} + B_{2}*x_{2}+\cdots + B_{k}*x_{k})}}\)
Unfortunately, this system of equations is not solveable analytically. However, this is where we can take what we have learned in this class and apply it. This sort of problem is great for Newton Raphson to solve.
Before we go on, let’s take a look at how normal linear regression is optimized. As we learned in module 2, a linear equation can be optimized by minimizing its residuals. In the case of linear regression, the residuals can be expressed as the \(L_2\) norm, where
\(S_p(\theta) = \sum_{i=1}^{n}|y_i - f(\theta)|^{p}\)
is defined as the \(L_p\) norm, and p = 2. Upon researching online, this is also called the “cost function”. We can use any of the methods listed in module 2 to compute this, but let’s look at an example with an ascent algorithm, specifically gradient descent.
x = runif(1000,-10,10)
y = x + rnorm(1000,0,2)+1
plot(x,y)
We can see that this is roughly a linear line with some noise added in. (Example partially from https://www.r-bloggers.com/linear-regression-by-gradient-descent/)
#Setting up the cost function for linear regression
cost = function(X,y,theta){
sum( (X %*% theta - y)^2 / (2*length(y)))
}
alpha = .02
iter = 1000
cost_history = double(iter)
theta_history = list(iter)
#initialize coefficients
theta = matrix(c(0,0),nrow = 2)
#add a column of 1's for the intercept coefficient
X = cbind(1, matrix(x))
#gradient descent
for(i in 1:iter){
error = (X%*%theta - y) #This will produce an nx1 matrix of errors for each y_i to y_hat
delta = t(X) %*% error / length(y)#creates an updating function for the linear regression based on the weighted average error for each coefficient
theta = theta -alpha *delta #eventually alpha*delta -> 0
cost_history[i] = cost(X,y,theta)
theta_history[[i]] = theta
}
print(theta)
## [,1]
## [1,] 0.962082
## [2,] 1.010703
We can see that this is very close to the glm function that R uses to optimize the coefficients of a function:
linear_reg = glm(y~x)
print(linear_reg$coefficients)
## (Intercept) x
## 0.962082 1.010703
We can see in the problem set 2, for part d of problem 1, we did something extremely similar. The updating function in this case is the alpha of .01 times the average error for each coefficient. Eventually, this will lead to a minimum (global or local) if the function converges.
This can also be done with the optim function:
theta_2 = matrix(c(0,0),nrow = 2)
optim(par = theta_2,fn=cost, X=X, y =y, method = "BFGS")
## $par
## [,1]
## [1,] 0.962082
## [2,] 1.010703
##
## $value
## [1] 1.812126
##
## $counts
## function gradient
## 14 3
##
## $convergence
## [1] 0
##
## $message
## NULL
Some brief explanations: method = “BFGS” uses a pseudo Newton method, like the ones we covered in class. In the results, par means the best estimate for the two parameters. value means the value of the function at the values of par. Counts means the number of calls to fn and gr respectively. Convergence of 0 means successful completion, otherwise an error occurred.
The idea for logistic regression is similar for linear regression, except that the response that we are predicting is an odds. However, the cost function that we used for linear regression does not perform as well here (mean-squared error). This is mainly because our equation is no longer linear due to the transformation we applied. Squaring the prediction, as in our L_2 estimator, does not work because it will produce a non-convex function with many local minimums.
The cost function that we will use is one called “Cross-Entropy”. The advantage of this method is that it can be divided into two separate cost functions: one for y=1 and one for y=0. Doesn’t this sound convenient for a logistic regression? Anyways, the cost function is as follows:
\(J(\theta) = \frac{1}{m}*\sum_{i=1}^{m}Cost(h_\theta*(x^{(i)}),y^{(i)})\)
And the cost differs by y=1 and 0: \(Cost(h_\theta*(x^{(i)}),y^{(i)}) = -log(h_\theta(x))\) if y=1
\(Cost(h_\theta*(x^{(i)}),y^{(i)}) = -log(1-h_\theta(x))\) if y=0
The advantages of taking the log of this cost function can be show here:
As we can see, these charts show that this cost function really heavily penalizes confident and wrong predictions.
We can combine these two cost functions through some clever mathematics:
\(J(\theta) = -\frac{1}{m} * \sum_{i=1}^{m}[y^{(i)}log(h_\theta(x^{(i)})) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))]\)
Since \(y^{(i)}\) is either 0 or 1, the appropriate cost function will be evaluated in case y is 1 or 0.
We will follow allow with an example I am partially deriving from https://www.r-bloggers.com/logistic-regression-regularized-with-optimization/
ex2data1 = fread("ex2data1.txt",col.names=c("Exam1","Exam2","Label"))
logistic_X = cbind(1,ex2data1$Exam1,ex2data1$Exam2)
ggplot(ex2data1,aes(x=Exam1, y=Exam2, col = factor(Label)))+
geom_point()
It looks like there is a somewhat clear delineation in the data between the admitted(1) and not admitted(0).
Here we can input our cost function:
cost_logistic = function(X, y, theta){
h_theta = sigmoid(X%*%theta)
cost = 1/nrow(X)*sum(-y*log(h_theta)-(1-y)*log(1-h_theta))
return(cost)
}
logistic_cost_history = double(iter)
logistic_theta_history = list(iter)
#The function below calculates gradient.
#Shamelessly taken from the site mentioned above
my_gradient=function(theta, X, y){
h_theta= sigmoid(X%*%theta)
## OPTION 1-- looping
# gradient=rep(0,ncol(X))
# for(j in 1:ncol(X)){
# for(i in 1:nrow(X)){
# gradient[j]=gradient[j]+1/nrow(X)*(my_sigmoid(X[i,]%*%theta)-y[i])*X[i,j]
# }
# }
# option2-more succint
gradient= 1/nrow(X)*(sigmoid(t(theta)%*%t(X))-t(y))%*%X
return(gradient)
}
initial_theta =matrix(rep(0,ncol(logistic_X)))
my_gradient(initial_theta,logistic_X,ex2data1$Label)
## [,1] [,2] [,3]
## [1,] -0.1 -12.00922 -11.26284
#We can see the inital values for the gradient at our inital theta
optim(par = initial_theta,X=logistic_X,y=ex2data1$Label,fn=cost_logistic, gr = my_gradient)
## $par
## [,1]
## [1,] -25.1647048
## [2,] 0.2062524
## [3,] 0.2015046
##
## $value
## [1] 0.2034977
##
## $counts
## function gradient
## 156 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(par = initial_theta,X=logistic_X,y=ex2data1$Label,fn=cost_logistic, method = "Nelder-Mead")
## $par
## [,1]
## [1,] -25.1647048
## [2,] 0.2062524
## [3,] 0.2015046
##
## $value
## [1] 0.2034977
##
## $counts
## function gradient
## 156 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(par = initial_theta,X=logistic_X,y=ex2data1$Label,fn=cost_logistic, method = "BFGS")
## $par
## [,1]
## [1,] -22.4364555
## [2,] 0.1841454
## [3,] 0.1795486
##
## $value
## [1] 0.2047206
##
## $counts
## function gradient
## 194 100
##
## $convergence
## [1] 1
##
## $message
## NULL
glm(Label~Exam1+Exam2, data=ex2data1, family = 'binomial')
##
## Call: glm(formula = Label ~ Exam1 + Exam2, family = "binomial", data = ex2data1)
##
## Coefficients:
## (Intercept) Exam1 Exam2
## -25.1613 0.2062 0.2015
##
## Degrees of Freedom: 99 Total (i.e. Null); 97 Residual
## Null Deviance: 134.6
## Residual Deviance: 40.7 AIC: 46.7
We can see that the optim function with our user specified gradient function performed the same as the one with the default method “Nelder-Mead”. These two are the same as the one with glm, which is R’s built in function for optimization. However, we can see that the optim function with our previous “BFGS” is not as accurate.
Since we have shown how the glm() function in R works, we will us this function to create our logistic regressions from now on.
Let us practice some logistic regression now.
pulsar = read.csv("pulsar_stars.csv")
pulsar$target_class_fct = as.factor(pulsar$target_class)
str(pulsar)
## 'data.frame': 17898 obs. of 10 variables:
## $ Mean.of.the.integrated.profile : num 140.6 102.5 103 136.8 88.7 ...
## $ Standard.deviation.of.the.integrated.profile: num 55.7 58.9 39.3 57.2 40.7 ...
## $ Excess.kurtosis.of.the.integrated.profile : num -0.2346 0.4653 0.3233 -0.0684 0.6009 ...
## $ Skewness.of.the.integrated.profile : num -0.7 -0.515 1.051 -0.636 1.123 ...
## $ Mean.of.the.DM.SNR.curve : num 3.2 1.68 3.12 3.64 1.18 ...
## $ Standard.deviation.of.the.DM.SNR.curve : num 19.1 14.9 21.7 21 11.5 ...
## $ Excess.kurtosis.of.the.DM.SNR.curve : num 7.98 10.58 7.74 6.9 14.27 ...
## $ Skewness.of.the.DM.SNR.curve : num 74.2 127.4 63.2 53.6 252.6 ...
## $ target_class : int 0 0 0 0 0 0 0 0 0 0 ...
## $ target_class_fct : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
pulsar_logistic = glm(target_class_fct~Mean.of.the.integrated.profile + Standard.deviation.of.the.integrated.profile + Excess.kurtosis.of.the.integrated.profile + Skewness.of.the.integrated.profile + Mean.of.the.DM.SNR.curve + Standard.deviation.of.the.DM.SNR.curve + Excess.kurtosis.of.the.DM.SNR.curve +Skewness.of.the.DM.SNR.curve , data = pulsar, family = 'binomial')
pulsar_logistic
##
## Call: glm(formula = target_class_fct ~ Mean.of.the.integrated.profile +
## Standard.deviation.of.the.integrated.profile + Excess.kurtosis.of.the.integrated.profile +
## Skewness.of.the.integrated.profile + Mean.of.the.DM.SNR.curve +
## Standard.deviation.of.the.DM.SNR.curve + Excess.kurtosis.of.the.DM.SNR.curve +
## Skewness.of.the.DM.SNR.curve, family = "binomial", data = pulsar)
##
## Coefficients:
## (Intercept)
## -9.01995
## Mean.of.the.integrated.profile
## 0.03026
## Standard.deviation.of.the.integrated.profile
## -0.03543
## Excess.kurtosis.of.the.integrated.profile
## 6.57706
## Skewness.of.the.integrated.profile
## -0.61624
## Mean.of.the.DM.SNR.curve
## -0.02858
## Standard.deviation.of.the.DM.SNR.curve
## 0.05317
## Excess.kurtosis.of.the.DM.SNR.curve
## 0.04775
## Skewness.of.the.DM.SNR.curve
## -0.00475
##
## Degrees of Freedom: 17897 Total (i.e. Null); 17889 Residual
## Null Deviance: 10960
## Residual Deviance: 2616 AIC: 2634
In this simple example, we just took the entire dataset and ran a simple glm on it. This gave us (a rather uninformative) log odds coefficent for each variable. We can get a slightly more informative answer by exponentiating it.
exp(pulsar_logistic$coefficients)
## (Intercept)
## 1.209717e-04
## Mean.of.the.integrated.profile
## 1.030722e+00
## Standard.deviation.of.the.integrated.profile
## 9.651900e-01
## Excess.kurtosis.of.the.integrated.profile
## 7.184261e+02
## Skewness.of.the.integrated.profile
## 5.399691e-01
## Mean.of.the.DM.SNR.curve
## 9.718201e-01
## Standard.deviation.of.the.DM.SNR.curve
## 1.054604e+00
## Excess.kurtosis.of.the.DM.SNR.curve
## 1.048907e+00
## Skewness.of.the.DM.SNR.curve
## 9.952616e-01
We can see that the odds of the star being a pulsar star increases dramatically with excess kurtosis, and not much else.
But we’re using the whole dataset here. How will we know how well it can predict on values not within the dataset? As we learned in class in module 8, we can use cross-validation in order to measure the accuracy of the model on data it hasn’t seen. As a brief reminder of how cross validation works:
The r package caret provides a convenient method to do cross validation
train_control = trainControl(method = "cv",number = 10,savePredictions = TRUE)
pulsar_lg_cv = train(target_class_fct~Mean.of.the.integrated.profile + Standard.deviation.of.the.integrated.profile + Excess.kurtosis.of.the.integrated.profile + Skewness.of.the.integrated.profile + Mean.of.the.DM.SNR.curve + Standard.deviation.of.the.DM.SNR.curve + Excess.kurtosis.of.the.DM.SNR.curve +Skewness.of.the.DM.SNR.curve , data = pulsar, trControl = train_control, method= 'glm',family="binomial")
summary(pulsar_lg_cv)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3813 -0.1644 -0.1000 -0.0565 3.6178
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -9.019954 0.977017 -9.232
## Mean.of.the.integrated.profile 0.030260 0.005925 5.107
## Standard.deviation.of.the.integrated.profile -0.035430 0.010386 -3.411
## Excess.kurtosis.of.the.integrated.profile 6.577063 0.300110 21.916
## Skewness.of.the.integrated.profile -0.616243 0.039709 -15.519
## Mean.of.the.DM.SNR.curve -0.028585 0.003268 -8.747
## Standard.deviation.of.the.DM.SNR.curve 0.053166 0.007364 7.220
## Excess.kurtosis.of.the.DM.SNR.curve 0.047749 0.085758 0.557
## Skewness.of.the.DM.SNR.curve -0.004750 0.003051 -1.557
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Mean.of.the.integrated.profile 3.27e-07 ***
## Standard.deviation.of.the.integrated.profile 0.000647 ***
## Excess.kurtosis.of.the.integrated.profile < 2e-16 ***
## Skewness.of.the.integrated.profile < 2e-16 ***
## Mean.of.the.DM.SNR.curve < 2e-16 ***
## Standard.deviation.of.the.DM.SNR.curve 5.21e-13 ***
## Excess.kurtosis.of.the.DM.SNR.curve 0.577676
## Skewness.of.the.DM.SNR.curve 0.119474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10959.5 on 17897 degrees of freedom
## Residual deviance: 2615.8 on 17889 degrees of freedom
## AIC: 2633.8
##
## Number of Fisher Scoring iterations: 8
pulsar_lg_cv$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.9791597 0.8672609 0.002660303 0.0181324
We can see we get the same coefficients as the last run. However, now we are also able to see the accuracy of the regression on the cross-validated samples. It looks like the accuracy predicts decently well.
We can also run the varImp function in the caret package in order to get the variable importance of each feature:
varImp(pulsar_lg_cv)
## glm variable importance
##
## Overall
## Excess.kurtosis.of.the.integrated.profile 100.000
## Skewness.of.the.integrated.profile 70.052
## Mean.of.the.DM.SNR.curve 38.344
## Standard.deviation.of.the.DM.SNR.curve 31.195
## Mean.of.the.integrated.profile 21.305
## Standard.deviation.of.the.integrated.profile 13.364
## Skewness.of.the.DM.SNR.curve 4.683
## Excess.kurtosis.of.the.DM.SNR.curve 0.000
It seems like we were right in that the Excess Kurtosis of the integrated profile plays the most important role, while it seems that we can remove the Excess kurtosis of the DM.SNR.curve without much loss.
The receiving operating characteristic (ROC) is a measure of classifier performance. This basically uses the proportion of positive data points that are correctly classified as positive vs. the proportion of data points that are incorrectly considered as positive. This can be used to generate a graphic that shows the tradeoff between the rate of correct prediction vs. the rate of incorrect prediction. Ultimately, we will be concerned about AUC, or the area under the curve. We can also create a ROC curve using the pROC library.
auc(pulsar$target_class_fct, pulsar_lg_cv$finalModel$fitted.values)
## Area under the curve: 0.9764
plot.roc(pulsar$target_class_fct,
pulsar_lg_cv$finalModel$fitted.values)
We can see that this model allows us to select a value with both a large sensitivity and specificity, hence the large AUC and the large accuracy.
This is just a quick application of what we have gone over on another dataset (and also to satisfy my own curiosity).
pokemon = read.csv("Pokemon.csv")
str(pokemon)
## 'data.frame': 800 obs. of 13 variables:
## $ X. : int 1 2 3 3 4 5 6 6 6 7 ...
## $ Name : Factor w/ 800 levels "Abomasnow","AbomasnowMega Abomasnow",..: 81 330 746 747 103 104 100 101 102 666 ...
## $ Type.1 : Factor w/ 18 levels "Bug","Dark","Dragon",..: 10 10 10 10 7 7 7 7 7 18 ...
## $ Type.2 : Factor w/ 19 levels "","Bug","Dark",..: 15 15 15 15 1 1 9 4 9 1 ...
## $ Total : int 318 405 525 625 309 405 534 634 634 314 ...
## $ HP : int 45 60 80 80 39 58 78 78 78 44 ...
## $ Attack : int 49 62 82 100 52 64 84 130 104 48 ...
## $ Defense : int 49 63 83 123 43 58 78 111 78 65 ...
## $ Sp..Atk : int 65 80 100 122 60 80 109 130 159 50 ...
## $ Sp..Def : int 65 80 100 120 50 65 85 85 115 64 ...
## $ Speed : int 45 60 80 80 65 80 100 100 100 43 ...
## $ Generation: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Legendary : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
pokemon_lg_cv = train(Legendary~ Type.1 + Type.2 + HP+Attack+Defense+Sp..Atk+Sp..Def+Speed+Generation, data= pokemon, trControl = train_control, method = 'glm', family='binomial')
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(pokemon_lg_cv)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.87606 -0.03379 -0.00095 -0.00001 2.10325
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.119e+01 2.256e+03 -0.023 0.981899
## Type.1Dark 1.665e+01 2.256e+03 0.007 0.994111
## Type.1Dragon 1.600e+01 2.256e+03 0.007 0.994343
## Type.1Electric 1.836e+01 2.256e+03 0.008 0.993509
## Type.1Fairy 1.355e+01 2.256e+03 0.006 0.995207
## Type.1Fighting -8.192e-01 4.427e+03 0.000 0.999852
## Type.1Fire 1.754e+01 2.256e+03 0.008 0.993798
## Type.1Flying 1.915e+01 2.256e+03 0.008 0.993227
## Type.1Ghost 1.730e+01 2.256e+03 0.008 0.993881
## Type.1Grass 1.727e+01 2.256e+03 0.008 0.993893
## Type.1Ground 1.875e+01 2.256e+03 0.008 0.993371
## Type.1Ice 1.805e+01 2.256e+03 0.008 0.993617
## Type.1Normal 1.335e+01 2.256e+03 0.006 0.995278
## Type.1Poison 2.645e+00 4.745e+03 0.001 0.999555
## Type.1Psychic 1.798e+01 2.256e+03 0.008 0.993641
## Type.1Rock 1.831e+01 2.256e+03 0.008 0.993524
## Type.1Steel 1.752e+01 2.256e+03 0.008 0.993806
## Type.1Water 1.588e+01 2.256e+03 0.007 0.994386
## Type.2Bug -1.437e+01 1.415e+04 -0.001 0.999190
## Type.2Dark -5.166e+00 1.963e+00 -2.632 0.008485 **
## Type.2Dragon -3.131e+00 1.273e+00 -2.459 0.013940 *
## Type.2Electric -1.212e-02 4.165e+00 -0.003 0.997679
## Type.2Fairy -2.509e+00 1.730e+00 -1.450 0.147000
## Type.2Fighting -1.120e+00 9.696e-01 -1.155 0.247944
## Type.2Fire -1.064e+00 1.744e+00 -0.610 0.541966
## Type.2Flying -1.348e-01 8.346e-01 -0.162 0.871658
## Type.2Ghost -8.753e-01 2.244e+00 -0.390 0.696517
## Type.2Grass -1.842e+01 4.211e+03 -0.004 0.996509
## Type.2Ground -2.601e+00 1.818e+00 -1.430 0.152592
## Type.2Ice -1.615e+00 1.817e+00 -0.889 0.374196
## Type.2Normal -1.872e+01 1.265e+04 -0.001 0.998820
## Type.2Poison -1.797e+01 3.365e+03 -0.005 0.995740
## Type.2Psychic -6.486e-01 1.118e+00 -0.580 0.561844
## Type.2Rock -1.579e+01 5.994e+03 -0.003 0.997899
## Type.2Steel -4.222e-01 1.602e+00 -0.264 0.792145
## Type.2Water -2.931e-01 1.815e+00 -0.161 0.871703
## HP 6.695e-02 1.640e-02 4.082 4.46e-05 ***
## Attack 3.030e-02 1.026e-02 2.954 0.003137 **
## Defense 4.825e-02 1.255e-02 3.845 0.000121 ***
## Sp..Atk 5.193e-02 1.275e-02 4.072 4.65e-05 ***
## Sp..Def 6.222e-02 1.351e-02 4.607 4.09e-06 ***
## Speed 6.571e-02 1.452e-02 4.524 6.06e-06 ***
## Generation 6.437e-01 2.058e-01 3.127 0.001764 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 450.90 on 799 degrees of freedom
## Residual deviance: 118.25 on 757 degrees of freedom
## AIC: 204.25
##
## Number of Fisher Scoring iterations: 20
varImp(pokemon_lg_cv)
## glm variable importance
##
## only 20 most important variables shown (out of 42)
##
## Overall
## Sp..Def 100.0000
## Speed 98.2075
## HP 88.6131
## Sp..Atk 88.3995
## Defense 83.4610
## Generation 67.8835
## Attack 64.1187
## Type.2Dark 57.1328
## Type.2Dragon 53.3708
## Type.2Fairy 31.4764
## Type.2Ground 31.0472
## Type.2Fighting 25.0759
## Type.2Ice 19.2862
## Type.2Fire 13.2342
## Type.2Psychic 12.5886
## Type.2Ghost 8.4625
## Type.2Steel 5.7165
## Type.2Flying 3.5029
## Type.2Water 3.5016
## Type.1Flying 0.1803
https://www.r-bloggers.com/logistic-regression-regularized-with-optimization/
https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html#cost-function
https://www.r-bloggers.com/linear-regression-by-gradient-descent/
https://rpubs.com/fhlgood/optlr
https://www.internalpointers.com/post/cost-function-logistic-regression
https://www.r-bloggers.com/logistic-regression-regularized-with-optimization/
https://www.r-bloggers.com/evaluating-logistic-regression-models/
https://stackoverflow.com/questions/31138751/roc-curve-from-training-data-in-caret