In chapter 6 of the book, Data Smart [http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html], by John Foreman, the synthesized challenge is to predict which of a retailers’s customers are pregnant based on a dataset of their shopping records.
A logistic regression model is used. The model is trained on the shopping records of 500 pregnant customers and 500 non-pregnant customers. The model is then tested on a dataset of 1000 different customers, each of whose pregnancy status is known.
Data Smart develops a logistic regression solution using Excel. We are free to user R to solve the problem and we do so using the code here. The dataset is available at the book’s website.
# load the required libraries
library(ROCR) # used to create the ROC curve
library(caret) # used to generate a confusion matrix
# A function to classify pregnancy outcomes based on probabilities of being pregnant.The arguments are: the threshold probability for a positive pregnancy prediction, the predicted pregnancy probabilities of each customer and the known pregnancy status of the customers
manualClass <- function(linThreshold, predictedPreg, knownResult){
pregPredClass <- predictedPreg # the predicted probability of pregnancy
pregPredClass[pregPredClass >= linThreshold] <- 1 # classifies as 1 if above threshold
pregPredClass[pregPredClass < linThreshold] <- 0 # classifies as zero if below threshold
linPregClass <- cbind(pregPredClass, knownResult) # stores predicted classification & known result
}
# Function to calculate and plot the Confusion Matrix parameters for the assigned linThreshold value
confusMatrix <- function(pregPredClass, knownResult){
pregPredClass <- as.factor(pregPredClass) # confusionMatrix() requires factor variables
levels(pregPredClass) <- c("notPregnant", "Pregnant") # assigns level labels
testResult <- as.factor(knownResult) # confusionMatrix() requires factor variables
levels(testResult) <- c("notPregnant", "Pregnant")
confusionMat <- confusionMatrix(pregPredClass, testResult, positive = "Pregnant")
}
# Read in the training data and test set data
trainData <- read.csv('./retailmart.csv')
testData <- read.csv('./testSet.csv')
# Isolate the test set result column (#20) from the predictor variables
testResult <- testData[, 20]
testNoResult <- testData[, -20]
# Train a logistic regression model
pTrainLogit <- glm(PREGNANT ~., data = trainData, family = binomial)
# Predict pregnancies for the test set based on the trained model
newdata <- testData
pregPredLogit <- predict.glm(pTrainLogit, newdata, type = "response") # probability values of pregnancy
# Manually classify the logistic regression predictions. The threshold is set deliberately high because false positives are particularly undesirable
pregPredClass <- manualClass(0.96, pregPredLogit, testResult)
# Generate the Confusion Matrix for the prediction results for the threshold value of 0.96
confusionMat <- confusMatrix(pregPredClass[,1], testResult)
print(confusionMat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction notPregnant Pregnant
## notPregnant 936 37
## Pregnant 4 23
##
## Accuracy : 0.959
## 95% CI : (0.9448, 0.9704)
## No Information Rate : 0.94
## P-Value [Acc > NIR] : 0.004943
##
## Kappa : 0.5105
## Mcnemar's Test P-Value : 5.806e-07
##
## Sensitivity : 0.3833
## Specificity : 0.9957
## Pos Pred Value : 0.8519
## Neg Pred Value : 0.9620
## Prevalence : 0.0600
## Detection Rate : 0.0230
## Detection Prevalence : 0.0270
## Balanced Accuracy : 0.6895
##
## 'Positive' Class : Pregnant
##
pregPredClassAlt <- manualClass(0.5, pregPredLogit, testResult)
confusionMatAlt <- confusMatrix(pregPredClass[,1], testResult)
print(confusionMatAlt)
## Confusion Matrix and Statistics
##
## Reference
## Prediction notPregnant Pregnant
## notPregnant 936 37
## Pregnant 4 23
##
## Accuracy : 0.959
## 95% CI : (0.9448, 0.9704)
## No Information Rate : 0.94
## P-Value [Acc > NIR] : 0.004943
##
## Kappa : 0.5105
## Mcnemar's Test P-Value : 5.806e-07
##
## Sensitivity : 0.3833
## Specificity : 0.9957
## Pos Pred Value : 0.8519
## Neg Pred Value : 0.9620
## Prevalence : 0.0600
## Detection Rate : 0.0230
## Detection Prevalence : 0.0270
## Balanced Accuracy : 0.6895
##
## 'Positive' Class : Pregnant
##
# Plot ROC curve using RORC package
predLogit <- prediction(pregPredLogit, testResult)
perfLogit <- performance(predLogit, "tpr", "fpr")
plot(perfLogit)
The results are consistent with those generated manually by the author in Excel in chapter 6 of his book.