Logistic Regression in R

*What is different between Linear and Logistic Regression?*

While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it isn’t the best tool for predicting the class of an observed data point. In order to estimate a classification, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression.

ogistic Regression is a variation of Linear Regression, useful when the observed dependent variable, y, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.

Despite the name logistic regression, it is actually a probabilistic classification model. Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function:

𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑂𝑓𝑎𝐶𝑙𝑎𝑠𝑠=𝜃(𝑦)=𝑒𝑦1+𝑒𝑦=𝑒𝑥𝑝(𝑦)/(1+𝑒𝑥𝑝(𝑦))=𝑝
 
which produces p-values between 0 (as y approaches minus infinity) and 1 (as y approaches plus infinity). This now becomes a special kind of non-linear regression.

In this equation, y is the regression result (the sum of the variables weighted by the coefficients), exp is the exponential function and  𝜃(𝑦)  is the logistic function, also called logistic curve. It is a common "S" shape (sigmoid curve), and was first developed for modelling population growth.

You might also have seen this function before, in another configuration:

𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑂𝑓𝑎𝐶𝑙𝑎𝑠𝑠=𝜃(𝑦)=11+𝑒−𝑥
 
So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:

<img src="https://ibm.box.com/shared/static/kgv9alcghmjcv97op4d6onkyxevk23b1.png", width = "400", align = "center">

library(ggplot2)
library(gridExtra)
library(VGAM)

## Loading required package: stats4

## Loading required package: splines

library(class)

About Dataset

Loan_status: Whether a loan is paid off on in collection

Principal: Basic principal loan amount at the

Originationterms Can be weekly (7 days), biweekly, and monthly payoff schedule

Effective_date: When the loan got originated and took effects

Due_date: Since it’s one-time payoff schedule, each loan has one single due date

age:age education: education gender:

download.file("https://ibm.box.com/shared/static/sv3oy0gyhuiifmosxsvxt5ogfs71iv37.csv",
              destfile = "LoanData.csv", quiet = TRUE)
options(scipen = 999) #disable scientific notation

Load Data from CSV file

LoanData <- read.csv("LoanData.csv")
head(LoanData)

##   X Unnamed..0 loan_status Principal terms effective_date   due_date age
## 1 0          0     PAIDOFF      1000    30     2016-09-08 2016-10-07  45
## 2 1          1     PAIDOFF      1000    30     2016-09-08 2016-10-07  50
## 3 2          2     PAIDOFF      1000    30     2016-09-08 2016-10-07  33
## 4 3          3     PAIDOFF      1000    15     2016-09-08 2016-09-22  27
## 5 4          4     PAIDOFF      1000    30     2016-09-09 2016-10-08  28
## 6 5          5     PAIDOFF       300     7     2016-09-09 2016-09-15  35
##              education Gender dayofweek
## 1 High School or Below   male         3
## 2             Bechalor female         3
## 3             Bechalor female         3
## 4              college   male         3
## 5              college female         4
## 6      Master or Above   male         4

How many rows, columns in total?

nrow(LoanData)

## [1] 400

ncol(LoanData)

## [1] 11

Data Visualization And Analysis

table(LoanData['loan_status'])

## 
## COLLECTION    PAIDOFF 
##        100        300

300 people have paid off the loan on time and 100 have gone into collection

Lets plot a Histogram of data

For different principal:

ggplot(LoanData, aes(x=Principal, fill=loan_status)) +geom_histogram(binwidth=120,alpha=0.35,aes(y=0.5*..density..),position='identity')

For different terms:

ggplot(LoanData, aes(x=terms, fill=loan_status)) +geom_histogram(binwidth=10,alpha=0.45,aes(y=1*..density..),position='identity')+scale_x_continuous(limits = c(0, 40))

## Warning: Removed 4 rows containing missing values (geom_bar).

For different age:

ggplot(LoanData, aes(x=age, fill=loan_status)) +geom_histogram(binwidth=1,alpha=0.55,aes(y=1*..density..),position='identity')+scale_x_continuous(limits = c(0, 40))

## Warning: Removed 28 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_bar).

Let’s examine the variables in two dimensions

hist_top <-ggplot(LoanData, aes(x=Principal, fill=loan_status)) +geom_histogram(binwidth=100,alpha=0.55,aes(y=1*..density..),position='identity')+ theme(legend.position="none")+scale_x_continuous(limits = c(200, 1100))

empty <- ggplot()+geom_point(aes(1,1), colour="white")+
         theme(axis.ticks=element_blank(), 
               panel.background=element_blank(), 
               axis.text.x=element_blank(), axis.text.y=element_blank(),           
               axis.title.x=element_blank(), axis.title.y=element_blank())

#qplot(Principal, age, data = LoanData, colour = loan_status)
scatter <-ggplot(LoanData, aes(Principal, age),fill= loan_status)  + geom_point(aes(colour = loan_status))+ theme(legend.position="top")

hist_right <-ggplot(LoanData, aes(x=age, fill=loan_status))+scale_x_continuous(limits = c(20, 45)) +geom_histogram(binwidth=1,alpha=0.55,aes(y=0.5*..density..),position='identity')+coord_flip()+ theme(legend.position="none")


grid.arrange(hist_top, empty, scatter, hist_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4))

## Warning: Removed 4 rows containing missing values (geom_bar).

## Warning: Removed 12 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_bar).

Pre-processing: Feature selection/extraction

ggplot(LoanData, aes(x=dayofweek, fill=loan_status))  +geom_histogram(binwidth=1,alpha=0.55,aes(y=1*..density..),position='identity')+scale_x_continuous(limits = c(0, 7))

## Warning: Removed 4 rows containing missing values (geom_bar).

We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4

Converting days of the week to categorical value

Create empty vector for indicator variable, Then, Set all values over the 3 day equal to one, else keep them at zero.

namevector <- c("Weekend")
LoanData[,namevector] <- 0
LoanData$Weekend[LoanData$dayofweek>3]<-1
head(LoanData)

##   X Unnamed..0 loan_status Principal terms effective_date   due_date age
## 1 0          0     PAIDOFF      1000    30     2016-09-08 2016-10-07  45
## 2 1          1     PAIDOFF      1000    30     2016-09-08 2016-10-07  50
## 3 2          2     PAIDOFF      1000    30     2016-09-08 2016-10-07  33
## 4 3          3     PAIDOFF      1000    15     2016-09-08 2016-09-22  27
## 5 4          4     PAIDOFF      1000    30     2016-09-09 2016-10-08  28
## 6 5          5     PAIDOFF       300     7     2016-09-09 2016-09-15  35
##              education Gender dayofweek Weekend
## 1 High School or Below   male         3       0
## 2             Bechalor female         3       0
## 3             Bechalor female         3       0
## 4              college   male         3       0
## 5              college female         4       1
## 6      Master or Above   male         4       1

Encoding one categorical feature

namevector <- c("Gender01")
LoanData[,namevector] <- 0
LoanData$Gender01[LoanData$Gender=='male']=1
head(LoanData[,c('Gender','Gender01')])

##   Gender Gender01
## 1   male        1
## 2 female        0
## 3 female        0
## 4   male        1
## 5 female        0
## 6   male        1

table(LoanData$Gender01, LoanData$loan_status)

##    
##     COLLECTION PAIDOFF
##   0         10      53
##   1         90     247

ggplot(LoanData, aes(x=Gender01, fill=loan_status))  +geom_histogram(binwidth=1,alpha=0.55,aes(y=1*..density..),position='identity')+scale_x_continuous(limits = c(0, 2))

## Warning: Removed 4 rows containing missing values (geom_bar).

As we can see, 84 % of female pay there loans while ony 73 % of males pay there loan

Convert Multiple Categorical Features Using One Hot Encoding

Some variables, such as Education, have multiple value, and we have to use __ one-hot encoding__ technique to discretize them, therefore, we use dummies library.

library(dummies)

## dummies-1.5.6 provided by Decision Patterns

Feature before one hot encoding:

head(LoanData['education'])

##              education
## 1 High School or Below
## 2             Bechalor
## 3             Bechalor
## 4              college
## 5              college
## 6      Master or Above

Use one hot encoding technique to convert categorical variables to binary variables and append them to the feature Data Frame

LoanData=dummy.data.frame(LoanData, names=c("education"))

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

head(LoanData[c('educationBechalor', 'educationcollege',  'educationHigh School or Below','educationMaster or Above')])

##   educationBechalor educationcollege educationHigh School or Below
## 1                 0                0                             1
## 2                 1                0                             0
## 3                 1                0                             0
## 4                 0                1                             0
## 5                 0                1                             0
## 6                 0                0                             0
##   educationMaster or Above
## 1                        0
## 2                        0
## 3                        0
## 4                        0
## 5                        0
## 6                        1

Assembling the features

Colunms <- c('Principal','terms','age','educationBechalor', 'educationcollege',  'educationHigh School or Below','educationMaster or Above','Weekend','Gender01')
Data <- LoanData[Colunms]
head(Data)

##   Principal terms age educationBechalor educationcollege
## 1      1000    30  45                 0                0
## 2      1000    30  50                 1                0
## 3      1000    30  33                 1                0
## 4      1000    15  27                 0                1
## 5      1000    30  28                 0                1
## 6       300     7  35                 0                0
##   educationHigh School or Below educationMaster or Above Weekend Gender01
## 1                             1                        0       0        1
## 2                             0                        0       0        0
## 3                             0                        0       0        0
## 4                             0                        0       0        1
## 5                             0                        0       1        0
## 6                             0                        1       1        1

Let’s put all the labels in the data frame y

NewColumn <- c("Class")
Data[,NewColumn] <- 0
Data$Class[LoanData$loan_status=='PAIDOFF']=1
head(Data[,NewColumn],10)

##  [1] 1 1 1 1 1 1 1 1 1 1

head(LoanData$loan_status,10)

##  [1] "PAIDOFF" "PAIDOFF" "PAIDOFF" "PAIDOFF" "PAIDOFF" "PAIDOFF" "PAIDOFF"
##  [8] "PAIDOFF" "PAIDOFF" "PAIDOFF"

Normalize Data

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

Data[Colunms] <- scale(Data[Colunms])
head(Data[Colunms])

##    Principal      terms        age educationBechalor educationcollege
## 1  0.5006747  0.9197424  2.3104561        -0.3860722       -0.8674676
## 2  0.5006747  0.9197424  3.1391707         2.5837138       -0.8674676
## 3  0.5006747  0.9197424  0.3215412         2.5837138       -0.8674676
## 4  0.5006747 -0.9320879 -0.6729162        -0.3860722        1.1498989
## 5  0.5006747  0.9197424 -0.5071733        -0.3860722        1.1498989
## 6 -5.3162931 -1.9197307  0.6530271        -0.3860722       -0.8674676
##   educationHigh School or Below educationMaster or Above    Weekend  Gender01
## 1                     1.1498989               -0.1003781 -1.2168652  0.431829
## 2                    -0.8674676               -0.1003781 -1.2168652 -2.309942
## 3                    -0.8674676               -0.1003781 -1.2168652 -2.309942
## 4                    -0.8674676               -0.1003781 -1.2168652  0.431829
## 5                    -0.8674676               -0.1003781  0.8197293 -2.309942
## 6                    -0.8674676                9.9374292  0.8197293  0.431829

Train Test Split

set.seed(3)

testindex <- sample.int(nrow(Data))[1:floor(0.1*nrow(Data))]
TestData <- Data[testindex,];
head(TestData)

##      Principal      terms          age educationBechalor educationcollege
## 261 -1.1613161 -0.9320879  3.304913593        -0.3860722        1.1498989
## 186  0.5006747  0.9197424 -0.009944575        -0.3860722       -0.8674676
## 140  0.5006747  0.9197424 -1.833116567        -0.3860722       -0.8674676
## 36  -1.1613161 -0.9320879 -0.341430391         2.5837138       -0.8674676
## 399  0.5006747  0.9197424 -0.507173300        -0.3860722        1.1498989
## 363  0.5006747  0.9197424 -0.838659116        -0.3860722       -0.8674676
##     educationHigh School or Below educationMaster or Above    Weekend  Gender01
## 261                    -0.8674676               -0.1003781 -1.2168652  0.431829
## 186                     1.1498989               -0.1003781 -1.2168652 -2.309942
## 140                     1.1498989               -0.1003781  0.8197293  0.431829
## 36                     -0.8674676               -0.1003781  0.8197293  0.431829
## 399                    -0.8674676               -0.1003781 -1.2168652  0.431829
## 363                     1.1498989               -0.1003781  0.8197293  0.431829
##     Class
## 261     1
## 186     1
## 140     1
## 36      1
## 399     0
## 363     0

TrainData=Data[-testindex,]
head(TrainData)

##    Principal      terms        age educationBechalor educationcollege
## 1  0.5006747  0.9197424  2.3104561        -0.3860722       -0.8674676
## 2  0.5006747  0.9197424  3.1391707         2.5837138       -0.8674676
## 3  0.5006747  0.9197424  0.3215412         2.5837138       -0.8674676
## 4  0.5006747 -0.9320879 -0.6729162        -0.3860722        1.1498989
## 5  0.5006747  0.9197424 -0.5071733        -0.3860722        1.1498989
## 6 -5.3162931 -1.9197307  0.6530271        -0.3860722       -0.8674676
##   educationHigh School or Below educationMaster or Above    Weekend  Gender01
## 1                     1.1498989               -0.1003781 -1.2168652  0.431829
## 2                    -0.8674676               -0.1003781 -1.2168652 -2.309942
## 3                    -0.8674676               -0.1003781 -1.2168652 -2.309942
## 4                    -0.8674676               -0.1003781 -1.2168652  0.431829
## 5                    -0.8674676               -0.1003781  0.8197293 -2.309942
## 6                    -0.8674676                9.9374292  0.8197293  0.431829
##   Class
## 1     1
## 2     1
## 3     1
## 4     1
## 5     1
## 6     1

Logistic Regression

model <- glm(Class~.,family=binomial(link='logit'),data=TrainData, control = list(maxit = 50))
summary(model)

## 
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = TrainData, 
##     control = list(maxit = 50))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1583  -0.2382   0.1628   0.9052   1.4077  
## 
## Coefficients: (1 not defined because of singularities)
##                                  Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)                       2.11019    7.28831   0.290        0.772    
## Principal                        -0.09595    0.20468  -0.469        0.639    
## terms                            -0.10262    0.17587  -0.584        0.560    
## age                               0.16290    0.15169   1.074        0.283    
## educationBechalor                -4.65162  245.20682  -0.019        0.985    
## educationcollege                 -6.81761  360.97144  -0.019        0.985    
## `educationHigh School or Below`  -7.13572  360.97144  -0.020        0.984    
## `educationMaster or Above`             NA         NA      NA           NA    
## Weekend                          -1.97879    0.35821  -5.524 0.0000000331 ***
## Gender01                         -0.24388    0.15557  -1.568        0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 404.88  on 359  degrees of freedom
## Residual deviance: 295.78  on 351  degrees of freedom
## AIC: 313.78
## 
## Number of Fisher Scoring iterations: 14

Prediction

fitted.results <- predict(model,newdata=TestData,type='response')

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

yhat <- ifelse(fitted.results > 0.5,1,0)
yhat[1:5]

## 261 186 140  36 399 
##   1   1   0   1   1

Let’s get the actual labels

y <- TestData[,c('Class')]
y[1:4]

## [1] 1 1 1 1

Model Evaluation

Let’s calculate the accuracy:

mean(yhat==y)

## [1] 0.575

Confusion Matrix

ConfusionMatrix<- table(paste(as.character(yhat)," pred", sep =""), paste(as.character(y)," true", sep =""))
ConfusionMatrix

##         
##          0 true 1 true
##   0 pred      1      8
##   1 pred      9     22

END

Logistic Regression in R

Anoop

14/07/2021