As with other time-series data, the predictors’ data must be available prior to time time of prediction in order to be used to create a future forecast.
The logistic regression equation would look like the statement below, using max temperature and relative humidity as predictors:
log(odds) = B0 + B1(max temperature) + B2(relative humidity)
library(forecast)
library(knitr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
setwd ("~/Documents/MBA 678/Unit 6")
mildew <-read.csv("Mildew.csv", stringsAsFactors = FALSE, header = TRUE)
#Now that we have some basics out of the way, let's create the scatter plots for Q3:
plot(mildew$MaxTemp ~ mildew$RelHumidity, xlab="Relative Humidity", ylab="Maximum Temperature", bty="l", col=mildew$OutbreakBinary+3, pch=15)
# Add a legend
legend(60, 29, c("No Outbreak", "Outbreak"), col=3:4, pch=15)
There does appear to be a relationship between the predictors and the outcome of an outbreak / no outbreak. The relationship that is most apparent has to do with the outbreak result and the relative humidity: Three of the four highest humidity observations had an outbreak. Though somewhat weaker, we also notice that the blue “outbreak” dots tend to appear in the upper half of the temperature range, with the combination of high heat and high humidity (upper right quadrant in the plot above) holding a cluster of outbreak observations.
#Compute the roll-forward naive forecasts. Treat 1987 - 1994 as the initial training period. Naive roll-forward for 1995 would match the value of 1994. Naive roll-forward for 1996 would match the actual value of 1995 and so on. Let's attempt Prof. Dean's shortcut method to create our naive forecast.
# Pull out 1994-1997
naive_mildew <- mildew$OutbreakBinary[(length(mildew$OutbreakBinary)-1-3):(length(mildew$OutbreakBinary)-1)]
#Sanity check: Expected results are 1,0,1,0.
naive_mildew
## [1] 1 0 1 0
#Sanity check passed. Let's create a nice table:
| Year | Forecast |
|---|---|
| 1995 | Outbreak |
| 1996 | No Outbreak |
| 1997 | Outbreak |
| 2000 | No Outbreak |
Based on this work, the naive roll-forward forecast for 2000 is No Outbreak.
Next, let’s examine the Confusion / Classification Matrix for this naive roll-forward forecast:
confusionMatrix(naive_mildew, mildew$OutbreakBinary[(length(mildew$OutbreakBinary)-3):length(mildew$OutbreakBinary)], positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 1
## 1 2 0
##
## Accuracy : 0.25
## 95% CI : (0.0063, 0.8059)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.9961
##
## Kappa : -0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.0000
## Specificity : 0.3333
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.2500
## Detection Rate : 0.0000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.1667
##
## 'Positive' Class : 1
##
Based on this exercise, our naive forecasts aren’t terribly strong, with only 25% accuracy.
#Partition into training and validation periods, where 1987 - 1994 are the training periods and 1995, 1996, 1997, and 2000 is our validation period.
train_mildew <- mildew[1:8, ]
#Sanity check
train_mildew
## Year Outbreak MaxTemp RelHumidity OutbreakBinary
## 1 1987 Yes 30.14 82.86 1
## 2 1988 No 30.66 79.57 0
## 3 1989 No 26.31 89.14 0
## 4 1990 Yes 28.43 91.00 1
## 5 1991 No 29.57 80.57 0
## 6 1992 Yes 31.25 67.82 1
## 7 1993 No 30.35 61.76 0
## 8 1994 Yes 30.71 81.14 1
# Now we create the model
LogReg_mildew <- glm(OutbreakBinary ~ MaxTemp + RelHumidity, data=train_mildew, family="binomial")
# Look at the summary
summary(LogReg_mildew)
##
## Call:
## glm(formula = OutbreakBinary ~ MaxTemp + RelHumidity, family = "binomial",
## data = train_mildew)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.7466 -1.7276 -0.3132 1.0552 -1.1419 1.2419 -0.3908 0.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543 44.4573 -1.263 0.207
## MaxTemp 1.3849 1.1406 1.214 0.225
## RelHumidity 0.1877 0.1578 1.189 0.234
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.0904 on 7 degrees of freedom
## Residual deviance: 8.1198 on 5 degrees of freedom
## AIC: 14.12
##
## Number of Fisher Scoring iterations: 5
#Now let's see the probability of having an outbreak on the validation period
mildew_predictions <- predict(LogReg_mildew, mildew[9:12,], type="response")
# See what we predicted
# This is the probability of having a mildew outbreak
mildew_predictions
## 9 10 11 12
## 0.1119407 0.7021411 0.5705413 0.3894790
#So we can interpret these forecasts, using a .5 cutoff, as:
| Year | Forecast | Actual |
|---|---|---|
| 1995 | No Outbreak | No Outbreak |
| 1996 | Outbreak | Outbreak |
| 1997 | Outbreak | No Outbreak |
| 2000 | No Outbreak | No Outbreak |
So it looks like we’ve done much better with the Logistic regression, improving our accuracy to 75% with a .5 cutoff. Let’s build the confusion matrix to see what that says:
# Generate the confusion matrix
confusionMatrix(ifelse(mildew_predictions > 0.5, 1, 0), mildew[9:12,]$OutbreakBinary, positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 0
## 1 1 1
##
## Accuracy : 0.75
## 95% CI : (0.1941, 0.9937)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.7383
##
## Kappa : 0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.6667
## Pos Pred Value : 0.5000
## Neg Pred Value : 1.0000
## Prevalence : 0.2500
## Detection Rate : 0.2500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8333
##
## 'Positive' Class : 1
##
Not too shabby!
As an analyst, I might report the forecasted probabilities rather than the cutoff values. The probabilities are richer in meaning than the binaries, and could add context around a business decision.