In this tutorial, we are going to learn how to perform basic logistic regression. This is similar to linear regression, except the variables are categorical.

First, we will import our dataset from Kaggle, it is a candy data set and we are going to see if we can predict if a candy will be chocolate from it's other variables: Here is the website of where to download the dataset: https://www.kaggle.com/fivethirtyeight/the-ultimate-halloween-candy-power-ranking/

inputData <- read.csv("file:///C:/Users/Dakota/Downloads/candy-data.csv")
head(inputData)
##   competitorname chocolate fruity caramel peanutyalmondy nougat
## 1      100 Grand         1      0       1              0      0
## 2   3 Musketeers         1      0       0              0      1
## 3       One dime         0      0       0              0      0
## 4    One quarter         0      0       0              0      0
## 5      Air Heads         0      1       0              0      0
## 6     Almond Joy         1      0       0              1      0
##   crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1                1    0   1        0        0.732        0.860   66.97173
## 2                0    0   1        0        0.604        0.511   67.60294
## 3                0    0   0        0        0.011        0.116   32.26109
## 4                0    0   0        0        0.011        0.511   46.11650
## 5                0    0   0        0        0.906        0.511   52.34146
## 6                0    0   1        0        0.465        0.767   50.34755

Some of the variables that are categorial are ones like: caramel and fruity, where a 0 indicates a non-event and a 1 indicates an event.

Now we will check for class bias, the proportion of events and non-events in the chocolate(Y variable) should be the same.

table(inputData$chocolate)
## 
##  0  1 
## 48 37

Our table shows pretty even proportions with the event proportions being slightly lower, indicating some potential class bias.

Now we are moving on to our next step of creating a training and test samples, we want to draw equal proportions of 0's and 1's to get rid of class bias.

#Create Training Data

input_ones <- inputData[which(inputData$chocolate == 1), ]  # all 1's
input_zeros <- inputData[which(inputData$chocolate == 0), ]  # all 0's
set.seed(100)  # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones))  # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones))  # 0's for training. Pick as many 0's as 1's
training_ones <- input_ones[input_ones_training_rows, ]  
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros)  # row bind the 1's and 0's 

#Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros)  # row bind the 1's and 0's 
head(inputData)
##   competitorname chocolate fruity caramel peanutyalmondy nougat
## 1      100 Grand         1      0       1              0      0
## 2   3 Musketeers         1      0       0              0      1
## 3       One dime         0      0       0              0      0
## 4    One quarter         0      0       0              0      0
## 5      Air Heads         0      1       0              0      0
## 6     Almond Joy         1      0       0              1      0
##   crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1                1    0   1        0        0.732        0.860   66.97173
## 2                0    0   1        0        0.604        0.511   67.60294
## 3                0    0   0        0        0.011        0.116   32.26109
## 4                0    0   0        0        0.011        0.511   46.11650
## 5                0    0   0        0        0.906        0.511   52.34146
## 6                0    0   1        0        0.465        0.767   50.34755

We now need to install and reference the smbinning package that converts our continous variables into categorical variables. From here we can capture the information from all the variables equally.

library(smbinning)
# segregate continuous and factor variables
factor_vars <- c ("competitorname", "fruity", "caramel", "peanutyalmondy", "nougat", "crispedricewafer", "hard", "bar", "pluribus")
continuous_vars <- c("sugarpercent", "pricepercent","winpercent")

iv_df <- data.frame(VARS=c(factor_vars, continuous_vars), IV=numeric(12))  # init for IV results
# compute IV for categoricals vars
for(factor_var in factor_vars){
  smb <- smbinning.factor(trainingData, y="chocolate", x=factor_var)  # WOE table
  if(class(smb) != "character"){ # heck if some error occured
    iv_df[iv_df$VARS == factor_var, "IV"] <- smb$iv
  }
}

# compute IV for continuous vars
for(continuous_var in continuous_vars){
  smb <- smbinning(trainingData, y="chocolate", x=continuous_var)  # WOE table
  if(class(smb) != "character"){  # any error while calculating scores.
    iv_df[iv_df$VARS == continuous_var, "IV"] <- smb$iv
  }
}
iv_df <- iv_df[order(-iv_df$IV), ]  # sort
iv_df
##                VARS     IV
## 12       winpercent 1.9178
## 11     pricepercent 1.4218
## 1    competitorname 0.0000
## 2            fruity 0.0000
## 3           caramel 0.0000
## 4    peanutyalmondy 0.0000
## 5            nougat 0.0000
## 6  crispedricewafer 0.0000
## 7              hard 0.0000
## 8               bar 0.0000
## 9          pluribus 0.0000
## 10     sugarpercent 0.0000

From here we use these results to build our logit models. I chose to only use the variables that were above zero.

logitMod <- glm(chocolate ~ winpercent + pricepercent, data=trainingData, family=binomial(link="logit"))
predicted <- plogis(predict(logitMod, testData))  # predicted scores
# or
predicted <- predict(logitMod, testData, type="response")  # predicted scores

We are now going to find the optimal score for predicting the cutoff for the model, we would expect it to be 0.5 if they were an even proportion between 0's and 1's, but we know that is not the case.

library(InformationValue)
optCutOff <- optimalCutoff(testData$chocolate, predicted)[1] 

optCutOff
## [1] 0.6256588

We will now look at our model to see certain values.

summary(logitMod)
## 
## Call:
## glm(formula = chocolate ~ winpercent + pricepercent, family = binomial(link = "logit"), 
##     data = trainingData)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.58294  -0.74381  -0.03106   0.55504   2.58571  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.63949    2.23677  -3.415 0.000637 ***
## winpercent    0.11777    0.03987   2.954 0.003142 ** 
## pricepercent  3.00439    1.37119   2.191 0.028446 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 69.315  on 49  degrees of freedom
## Residual deviance: 43.652  on 47  degrees of freedom
## AIC: 49.652
## 
## Number of Fisher Scoring iterations: 5

We can see that all of our variables our significant.

library(car)
vif(logitMod)
##   winpercent pricepercent 
##      1.00129      1.00129

Looking at our VIF, we can see that there is no multicollinearity because all of our variables are below 4.

Now looking at misclassification error, this shows the percentage of mismatch of predicted verses actual number of 1's and 0's. The lower the better, and we can see that ours is very low!

misClassError(testData$chocolate, predicted, threshold = optCutOff)
## [1] 0.1143

Now looking at Receiver Operating Characteristics Curve(ROC), this traces the true positives percentages predicted by the logit model. The greater the area under the curve the more accurate the model, as you can see this model has a high area under the curve, making it accurate.

plotROC(testData$chocolate,predicted)

Now moving on to concordance, this measures essentially the predicted number of pairs of 1 to 0, in a percect model they would be all 1's and this model would be 100%. Ours is 94%, which is pretty high!

Concordance(testData$chocolate,predicted)
## $Concordance
## [1] 0.9456522
## 
## $Discordance
## [1] 0.05434783
## 
## $Tied
## [1] 0.00000000000000002081668
## 
## $Pairs
## [1] 276

Sensitivity measures the number of actual 1's and predicted 1's divided by the number of actual 1's.

sensitivity(testData$chocolate,predicted, threshold = optCutOff)
## [1] 0.75

Specificity measures the actual number of 0's and predicted number of 0's divided by the actual number of 0's.

specificity(testData$chocolate,predicted, threshold = optCutOff)
## [1] 0.9565217

Finally we have the confusion matrix

confusionMatrix(testData$chocolate,predicted, threshold = optCutOff)
##    0 1
## 0 22 3
## 1  1 9

To conclude, after running all these tests on our logistic model, we can see it did a pretty good job, but this was just one example of how to run a regression model when you have categorical data!