*What is different between Linear and Logistic Regression?*
While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it isn’t the best tool for predicting the class of an observed data point. In order to estimate a classification, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression.
ogistic Regression is a variation of Linear Regression, useful when the observed dependent variable, y, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.
Despite the name logistic regression, it is actually a probabilistic classification model. Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function:
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑂𝑓𝑎𝐶𝑙𝑎𝑠𝑠=𝜃(𝑦)=𝑒𝑦1+𝑒𝑦=𝑒𝑥𝑝(𝑦)/(1+𝑒𝑥𝑝(𝑦))=𝑝
which produces p-values between 0 (as y approaches minus infinity) and 1 (as y approaches plus infinity). This now becomes a special kind of non-linear regression.
In this equation, y is the regression result (the sum of the variables weighted by the coefficients), exp is the exponential function and 𝜃(𝑦) is the logistic function, also called logistic curve. It is a common "S" shape (sigmoid curve), and was first developed for modelling population growth.
You might also have seen this function before, in another configuration:
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑂𝑓𝑎𝐶𝑙𝑎𝑠𝑠=𝜃(𝑦)=11+𝑒−𝑥
So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:
<img src="https://ibm.box.com/shared/static/kgv9alcghmjcv97op4d6onkyxevk23b1.png", width = "400", align = "center">
library(ggplot2)
library(gridExtra)
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
library(class)
About Dataset
Loan_status: Whether a loan is paid off on in collection
Principal: Basic principal loan amount at the
Originationterms Can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_date: When the loan got originated and took effects
Due_date: Since it’s one-time payoff schedule, each loan has one single due date
age:age education: education gender:
download.file("https://ibm.box.com/shared/static/sv3oy0gyhuiifmosxsvxt5ogfs71iv37.csv",
destfile = "LoanData.csv", quiet = TRUE)
options(scipen = 999) #disable scientific notation
Load Data from CSV file
LoanData <- read.csv("LoanData.csv")
head(LoanData)
## X Unnamed..0 loan_status Principal terms effective_date due_date age
## 1 0 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45
## 2 1 1 PAIDOFF 1000 30 2016-09-08 2016-10-07 50
## 3 2 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33
## 4 3 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27
## 5 4 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28
## 6 5 5 PAIDOFF 300 7 2016-09-09 2016-09-15 35
## education Gender dayofweek
## 1 High School or Below male 3
## 2 Bechalor female 3
## 3 Bechalor female 3
## 4 college male 3
## 5 college female 4
## 6 Master or Above male 4
How many rows, columns in total?
nrow(LoanData)
## [1] 400
ncol(LoanData)
## [1] 11
Data Visualization And Analysis
table(LoanData['loan_status'])
##
## COLLECTION PAIDOFF
## 100 300
300 people have paid off the loan on time and 100 have gone into collection
Lets plot a Histogram of data
For different principal:
ggplot(LoanData, aes(x=Principal, fill=loan_status)) +geom_histogram(binwidth=120,alpha=0.35,aes(y=0.5*..density..),position='identity')
For different terms:
ggplot(LoanData, aes(x=terms, fill=loan_status)) +geom_histogram(binwidth=10,alpha=0.45,aes(y=1*..density..),position='identity')+scale_x_continuous(limits = c(0, 40))
## Warning: Removed 4 rows containing missing values (geom_bar).
For different age:
ggplot(LoanData, aes(x=age, fill=loan_status)) +geom_histogram(binwidth=1,alpha=0.55,aes(y=1*..density..),position='identity')+scale_x_continuous(limits = c(0, 40))
## Warning: Removed 28 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
Let’s examine the variables in two dimensions
hist_top <-ggplot(LoanData, aes(x=Principal, fill=loan_status)) +geom_histogram(binwidth=100,alpha=0.55,aes(y=1*..density..),position='identity')+ theme(legend.position="none")+scale_x_continuous(limits = c(200, 1100))
empty <- ggplot()+geom_point(aes(1,1), colour="white")+
theme(axis.ticks=element_blank(),
panel.background=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(),
axis.title.x=element_blank(), axis.title.y=element_blank())
#qplot(Principal, age, data = LoanData, colour = loan_status)
scatter <-ggplot(LoanData, aes(Principal, age),fill= loan_status) + geom_point(aes(colour = loan_status))+ theme(legend.position="top")
hist_right <-ggplot(LoanData, aes(x=age, fill=loan_status))+scale_x_continuous(limits = c(20, 45)) +geom_histogram(binwidth=1,alpha=0.55,aes(y=0.5*..density..),position='identity')+coord_flip()+ theme(legend.position="none")
grid.arrange(hist_top, empty, scatter, hist_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4))
## Warning: Removed 4 rows containing missing values (geom_bar).
## Warning: Removed 12 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
Pre-processing: Feature selection/extraction
ggplot(LoanData, aes(x=dayofweek, fill=loan_status)) +geom_histogram(binwidth=1,alpha=0.55,aes(y=1*..density..),position='identity')+scale_x_continuous(limits = c(0, 7))
## Warning: Removed 4 rows containing missing values (geom_bar).
We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4
Converting days of the week to categorical value
Create empty vector for indicator variable, Then, Set all values over the 3 day equal to one, else keep them at zero.
namevector <- c("Weekend")
LoanData[,namevector] <- 0
LoanData$Weekend[LoanData$dayofweek>3]<-1
head(LoanData)
## X Unnamed..0 loan_status Principal terms effective_date due_date age
## 1 0 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45
## 2 1 1 PAIDOFF 1000 30 2016-09-08 2016-10-07 50
## 3 2 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33
## 4 3 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27
## 5 4 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28
## 6 5 5 PAIDOFF 300 7 2016-09-09 2016-09-15 35
## education Gender dayofweek Weekend
## 1 High School or Below male 3 0
## 2 Bechalor female 3 0
## 3 Bechalor female 3 0
## 4 college male 3 0
## 5 college female 4 1
## 6 Master or Above male 4 1
Encoding one categorical feature
namevector <- c("Gender01")
LoanData[,namevector] <- 0
LoanData$Gender01[LoanData$Gender=='male']=1
head(LoanData[,c('Gender','Gender01')])
## Gender Gender01
## 1 male 1
## 2 female 0
## 3 female 0
## 4 male 1
## 5 female 0
## 6 male 1
table(LoanData$Gender01, LoanData$loan_status)
##
## COLLECTION PAIDOFF
## 0 10 53
## 1 90 247
ggplot(LoanData, aes(x=Gender01, fill=loan_status)) +geom_histogram(binwidth=1,alpha=0.55,aes(y=1*..density..),position='identity')+scale_x_continuous(limits = c(0, 2))
## Warning: Removed 4 rows containing missing values (geom_bar).
As we can see, 84 % of female pay there loans while ony 73 % of males pay there loan
Convert Multiple Categorical Features Using One Hot Encoding
Some variables, such as Education, have multiple value, and we have to use __ one-hot encoding__ technique to discretize them, therefore, we use dummies library.
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
Feature before one hot encoding:
head(LoanData['education'])
## education
## 1 High School or Below
## 2 Bechalor
## 3 Bechalor
## 4 college
## 5 college
## 6 Master or Above
Use one hot encoding technique to convert categorical variables to binary variables and append them to the feature Data Frame
LoanData=dummy.data.frame(LoanData, names=c("education"))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
head(LoanData[c('educationBechalor', 'educationcollege', 'educationHigh School or Below','educationMaster or Above')])
## educationBechalor educationcollege educationHigh School or Below
## 1 0 0 1
## 2 1 0 0
## 3 1 0 0
## 4 0 1 0
## 5 0 1 0
## 6 0 0 0
## educationMaster or Above
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
Assembling the features
Colunms <- c('Principal','terms','age','educationBechalor', 'educationcollege', 'educationHigh School or Below','educationMaster or Above','Weekend','Gender01')
Data <- LoanData[Colunms]
head(Data)
## Principal terms age educationBechalor educationcollege
## 1 1000 30 45 0 0
## 2 1000 30 50 1 0
## 3 1000 30 33 1 0
## 4 1000 15 27 0 1
## 5 1000 30 28 0 1
## 6 300 7 35 0 0
## educationHigh School or Below educationMaster or Above Weekend Gender01
## 1 1 0 0 1
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 1
## 5 0 0 1 0
## 6 0 1 1 1
Let’s put all the labels in the data frame y
NewColumn <- c("Class")
Data[,NewColumn] <- 0
Data$Class[LoanData$loan_status=='PAIDOFF']=1
head(Data[,NewColumn],10)
## [1] 1 1 1 1 1 1 1 1 1 1
head(LoanData$loan_status,10)
## [1] "PAIDOFF" "PAIDOFF" "PAIDOFF" "PAIDOFF" "PAIDOFF" "PAIDOFF" "PAIDOFF"
## [8] "PAIDOFF" "PAIDOFF" "PAIDOFF"
Normalize Data
Data Standardization give data zero mean and unit variance (technically should be done after train test split )
Data[Colunms] <- scale(Data[Colunms])
head(Data[Colunms])
## Principal terms age educationBechalor educationcollege
## 1 0.5006747 0.9197424 2.3104561 -0.3860722 -0.8674676
## 2 0.5006747 0.9197424 3.1391707 2.5837138 -0.8674676
## 3 0.5006747 0.9197424 0.3215412 2.5837138 -0.8674676
## 4 0.5006747 -0.9320879 -0.6729162 -0.3860722 1.1498989
## 5 0.5006747 0.9197424 -0.5071733 -0.3860722 1.1498989
## 6 -5.3162931 -1.9197307 0.6530271 -0.3860722 -0.8674676
## educationHigh School or Below educationMaster or Above Weekend Gender01
## 1 1.1498989 -0.1003781 -1.2168652 0.431829
## 2 -0.8674676 -0.1003781 -1.2168652 -2.309942
## 3 -0.8674676 -0.1003781 -1.2168652 -2.309942
## 4 -0.8674676 -0.1003781 -1.2168652 0.431829
## 5 -0.8674676 -0.1003781 0.8197293 -2.309942
## 6 -0.8674676 9.9374292 0.8197293 0.431829
Train Test Split
set.seed(3)
testindex <- sample.int(nrow(Data))[1:floor(0.1*nrow(Data))]
TestData <- Data[testindex,];
head(TestData)
## Principal terms age educationBechalor educationcollege
## 261 -1.1613161 -0.9320879 3.304913593 -0.3860722 1.1498989
## 186 0.5006747 0.9197424 -0.009944575 -0.3860722 -0.8674676
## 140 0.5006747 0.9197424 -1.833116567 -0.3860722 -0.8674676
## 36 -1.1613161 -0.9320879 -0.341430391 2.5837138 -0.8674676
## 399 0.5006747 0.9197424 -0.507173300 -0.3860722 1.1498989
## 363 0.5006747 0.9197424 -0.838659116 -0.3860722 -0.8674676
## educationHigh School or Below educationMaster or Above Weekend Gender01
## 261 -0.8674676 -0.1003781 -1.2168652 0.431829
## 186 1.1498989 -0.1003781 -1.2168652 -2.309942
## 140 1.1498989 -0.1003781 0.8197293 0.431829
## 36 -0.8674676 -0.1003781 0.8197293 0.431829
## 399 -0.8674676 -0.1003781 -1.2168652 0.431829
## 363 1.1498989 -0.1003781 0.8197293 0.431829
## Class
## 261 1
## 186 1
## 140 1
## 36 1
## 399 0
## 363 0
TrainData=Data[-testindex,]
head(TrainData)
## Principal terms age educationBechalor educationcollege
## 1 0.5006747 0.9197424 2.3104561 -0.3860722 -0.8674676
## 2 0.5006747 0.9197424 3.1391707 2.5837138 -0.8674676
## 3 0.5006747 0.9197424 0.3215412 2.5837138 -0.8674676
## 4 0.5006747 -0.9320879 -0.6729162 -0.3860722 1.1498989
## 5 0.5006747 0.9197424 -0.5071733 -0.3860722 1.1498989
## 6 -5.3162931 -1.9197307 0.6530271 -0.3860722 -0.8674676
## educationHigh School or Below educationMaster or Above Weekend Gender01
## 1 1.1498989 -0.1003781 -1.2168652 0.431829
## 2 -0.8674676 -0.1003781 -1.2168652 -2.309942
## 3 -0.8674676 -0.1003781 -1.2168652 -2.309942
## 4 -0.8674676 -0.1003781 -1.2168652 0.431829
## 5 -0.8674676 -0.1003781 0.8197293 -2.309942
## 6 -0.8674676 9.9374292 0.8197293 0.431829
## Class
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
Logistic Regression
model <- glm(Class~.,family=binomial(link='logit'),data=TrainData, control = list(maxit = 50))
summary(model)
##
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = TrainData,
## control = list(maxit = 50))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1583 -0.2382 0.1628 0.9052 1.4077
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.11019 7.28831 0.290 0.772
## Principal -0.09595 0.20468 -0.469 0.639
## terms -0.10262 0.17587 -0.584 0.560
## age 0.16290 0.15169 1.074 0.283
## educationBechalor -4.65162 245.20682 -0.019 0.985
## educationcollege -6.81761 360.97144 -0.019 0.985
## `educationHigh School or Below` -7.13572 360.97144 -0.020 0.984
## `educationMaster or Above` NA NA NA NA
## Weekend -1.97879 0.35821 -5.524 0.0000000331 ***
## Gender01 -0.24388 0.15557 -1.568 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 404.88 on 359 degrees of freedom
## Residual deviance: 295.78 on 351 degrees of freedom
## AIC: 313.78
##
## Number of Fisher Scoring iterations: 14
Prediction
fitted.results <- predict(model,newdata=TestData,type='response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
yhat <- ifelse(fitted.results > 0.5,1,0)
yhat[1:5]
## 261 186 140 36 399
## 1 1 0 1 1
Let’s get the actual labels
y <- TestData[,c('Class')]
y[1:4]
## [1] 1 1 1 1
Model Evaluation
Let’s calculate the accuracy:
mean(yhat==y)
## [1] 0.575
Confusion Matrix
ConfusionMatrix<- table(paste(as.character(yhat)," pred", sep =""), paste(as.character(y)," true", sep =""))
ConfusionMatrix
##
## 0 true 1 true
## 0 pred 1 8
## 1 pred 9 22
END