In this tutorial, we are going to learn how to perform basic logistic regression. This is similar to linear regression, except the variables are categorical.
First, we will import our dataset from Kaggle, it is a candy data set and we are going to see if we can predict if a candy will be chocolate from it's other variables: Here is the website of where to download the dataset: https://www.kaggle.com/fivethirtyeight/the-ultimate-halloween-candy-power-ranking/
inputData <- read.csv("file:///C:/Users/Dakota/Downloads/candy-data.csv")
head(inputData)
## competitorname chocolate fruity caramel peanutyalmondy nougat
## 1 100 Grand 1 0 1 0 0
## 2 3 Musketeers 1 0 0 0 1
## 3 One dime 0 0 0 0 0
## 4 One quarter 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0
## 6 Almond Joy 1 0 0 1 0
## crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1 1 0 1 0 0.732 0.860 66.97173
## 2 0 0 1 0 0.604 0.511 67.60294
## 3 0 0 0 0 0.011 0.116 32.26109
## 4 0 0 0 0 0.011 0.511 46.11650
## 5 0 0 0 0 0.906 0.511 52.34146
## 6 0 0 1 0 0.465 0.767 50.34755
Some of the variables that are categorial are ones like: caramel and fruity, where a 0 indicates a non-event and a 1 indicates an event.
Now we will check for class bias, the proportion of events and non-events in the chocolate(Y variable) should be the same.
table(inputData$chocolate)
##
## 0 1
## 48 37
Our table shows pretty even proportions with the event proportions being slightly lower, indicating some potential class bias.
Now we are moving on to our next step of creating a training and test samples, we want to draw equal proportions of 0's and 1's to get rid of class bias.
#Create Training Data
input_ones <- inputData[which(inputData$chocolate == 1), ] # all 1's
input_zeros <- inputData[which(inputData$chocolate == 0), ] # all 0's
set.seed(100) # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones)) # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones)) # 0's for training. Pick as many 0's as 1's
training_ones <- input_ones[input_ones_training_rows, ]
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros) # row bind the 1's and 0's
#Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros) # row bind the 1's and 0's
head(inputData)
## competitorname chocolate fruity caramel peanutyalmondy nougat
## 1 100 Grand 1 0 1 0 0
## 2 3 Musketeers 1 0 0 0 1
## 3 One dime 0 0 0 0 0
## 4 One quarter 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0
## 6 Almond Joy 1 0 0 1 0
## crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1 1 0 1 0 0.732 0.860 66.97173
## 2 0 0 1 0 0.604 0.511 67.60294
## 3 0 0 0 0 0.011 0.116 32.26109
## 4 0 0 0 0 0.011 0.511 46.11650
## 5 0 0 0 0 0.906 0.511 52.34146
## 6 0 0 1 0 0.465 0.767 50.34755
We now need to install and reference the smbinning package that converts our continous variables into categorical variables. From here we can capture the information from all the variables equally.
library(smbinning)
# segregate continuous and factor variables
factor_vars <- c ("competitorname", "fruity", "caramel", "peanutyalmondy", "nougat", "crispedricewafer", "hard", "bar", "pluribus")
continuous_vars <- c("sugarpercent", "pricepercent","winpercent")
iv_df <- data.frame(VARS=c(factor_vars, continuous_vars), IV=numeric(12)) # init for IV results
# compute IV for categoricals vars
for(factor_var in factor_vars){
smb <- smbinning.factor(trainingData, y="chocolate", x=factor_var) # WOE table
if(class(smb) != "character"){ # heck if some error occured
iv_df[iv_df$VARS == factor_var, "IV"] <- smb$iv
}
}
# compute IV for continuous vars
for(continuous_var in continuous_vars){
smb <- smbinning(trainingData, y="chocolate", x=continuous_var) # WOE table
if(class(smb) != "character"){ # any error while calculating scores.
iv_df[iv_df$VARS == continuous_var, "IV"] <- smb$iv
}
}
iv_df <- iv_df[order(-iv_df$IV), ] # sort
iv_df
## VARS IV
## 12 winpercent 1.9178
## 11 pricepercent 1.4218
## 1 competitorname 0.0000
## 2 fruity 0.0000
## 3 caramel 0.0000
## 4 peanutyalmondy 0.0000
## 5 nougat 0.0000
## 6 crispedricewafer 0.0000
## 7 hard 0.0000
## 8 bar 0.0000
## 9 pluribus 0.0000
## 10 sugarpercent 0.0000
From here we use these results to build our logit models. I chose to only use the variables that were above zero.
logitMod <- glm(chocolate ~ winpercent + pricepercent, data=trainingData, family=binomial(link="logit"))
predicted <- plogis(predict(logitMod, testData)) # predicted scores
# or
predicted <- predict(logitMod, testData, type="response") # predicted scores
We are now going to find the optimal score for predicting the cutoff for the model, we would expect it to be 0.5 if they were an even proportion between 0's and 1's, but we know that is not the case.
library(InformationValue)
optCutOff <- optimalCutoff(testData$chocolate, predicted)[1]
optCutOff
## [1] 0.6256588
We will now look at our model to see certain values.
summary(logitMod)
##
## Call:
## glm(formula = chocolate ~ winpercent + pricepercent, family = binomial(link = "logit"),
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.58294 -0.74381 -0.03106 0.55504 2.58571
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.63949 2.23677 -3.415 0.000637 ***
## winpercent 0.11777 0.03987 2.954 0.003142 **
## pricepercent 3.00439 1.37119 2.191 0.028446 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69.315 on 49 degrees of freedom
## Residual deviance: 43.652 on 47 degrees of freedom
## AIC: 49.652
##
## Number of Fisher Scoring iterations: 5
We can see that all of our variables our significant.
library(car)
vif(logitMod)
## winpercent pricepercent
## 1.00129 1.00129
Looking at our VIF, we can see that there is no multicollinearity because all of our variables are below 4.
Now looking at misclassification error, this shows the percentage of mismatch of predicted verses actual number of 1's and 0's. The lower the better, and we can see that ours is very low!
misClassError(testData$chocolate, predicted, threshold = optCutOff)
## [1] 0.1143
Now looking at Receiver Operating Characteristics Curve(ROC), this traces the true positives percentages predicted by the logit model. The greater the area under the curve the more accurate the model, as you can see this model has a high area under the curve, making it accurate.
plotROC(testData$chocolate,predicted)
Now moving on to concordance, this measures essentially the predicted number of pairs of 1 to 0, in a percect model they would be all 1's and this model would be 100%. Ours is 94%, which is pretty high!
Concordance(testData$chocolate,predicted)
## $Concordance
## [1] 0.9456522
##
## $Discordance
## [1] 0.05434783
##
## $Tied
## [1] 0.00000000000000002081668
##
## $Pairs
## [1] 276
Sensitivity measures the number of actual 1's and predicted 1's divided by the number of actual 1's.
sensitivity(testData$chocolate,predicted, threshold = optCutOff)
## [1] 0.75
Specificity measures the actual number of 0's and predicted number of 0's divided by the actual number of 0's.
specificity(testData$chocolate,predicted, threshold = optCutOff)
## [1] 0.9565217
Finally we have the confusion matrix
confusionMatrix(testData$chocolate,predicted, threshold = optCutOff)
## 0 1
## 0 22 3
## 1 1 9
To conclude, after running all these tests on our logistic model, we can see it did a pretty good job, but this was just one example of how to run a regression model when you have categorical data!