MBA Assignment 7

Question 1

As with other time-series data, the predictors’ data must be available prior to time time of prediction in order to be used to create a future forecast.

Question 2

The logistic regression equation would look like the statement below, using max temperature and relative humidity as predictors:

log(odds) = B0 + B1(max temperature) + B2(relative humidity)

Question 3

library(forecast)
library(knitr)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

setwd ("~/Documents/MBA 678/Unit 6")
mildew <-read.csv("Mildew.csv", stringsAsFactors = FALSE, header = TRUE)
#Now that we have some basics out of the way, let's create the scatter plots for Q3:
plot(mildew$MaxTemp ~ mildew$RelHumidity, xlab="Relative Humidity", ylab="Maximum Temperature", bty="l", col=mildew$OutbreakBinary+3, pch=15)

# Add a legend
legend(60, 29, c("No Outbreak", "Outbreak"), col=3:4, pch=15)

There does appear to be a relationship between the predictors and the outcome of an outbreak / no outbreak. The relationship that is most apparent has to do with the outbreak result and the relative humidity: Three of the four highest humidity observations had an outbreak. Though somewhat weaker, we also notice that the blue “outbreak” dots tend to appear in the upper half of the temperature range, with the combination of high heat and high humidity (upper right quadrant in the plot above) holding a cluster of outbreak observations.

Question 4

#Compute the roll-forward naive forecasts.  Treat 1987 - 1994 as the initial training period.  Naive roll-forward for 1995 would match the value of 1994.  Naive roll-forward for 1996 would match the actual value of 1995 and so on. Let's attempt Prof. Dean's shortcut method to create our naive forecast.  
# Pull out 1994-1997
naive_mildew <- mildew$OutbreakBinary[(length(mildew$OutbreakBinary)-1-3):(length(mildew$OutbreakBinary)-1)]

#Sanity check: Expected results are 1,0,1,0.
naive_mildew

## [1] 1 0 1 0

#Sanity check passed.  Let's create a nice table:

Year	Forecast
1995	Outbreak
1996	No Outbreak
1997	Outbreak
2000	No Outbreak

Based on this work, the naive roll-forward forecast for 2000 is No Outbreak.

Next, let’s examine the Confusion / Classification Matrix for this naive roll-forward forecast:

confusionMatrix(naive_mildew, mildew$OutbreakBinary[(length(mildew$OutbreakBinary)-3):length(mildew$OutbreakBinary)], positive=c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 1 1
##          1 2 0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : 1               
##

Based on this exercise, our naive forecasts aren’t terribly strong, with only 25% accuracy.

Question 5

#Partition into training and validation periods, where 1987 - 1994 are the training periods and 1995, 1996, 1997, and 2000 is our validation period.  
train_mildew <- mildew[1:8, ]
#Sanity check
train_mildew

##   Year Outbreak MaxTemp RelHumidity OutbreakBinary
## 1 1987      Yes   30.14       82.86              1
## 2 1988       No   30.66       79.57              0
## 3 1989       No   26.31       89.14              0
## 4 1990      Yes   28.43       91.00              1
## 5 1991       No   29.57       80.57              0
## 6 1992      Yes   31.25       67.82              1
## 7 1993       No   30.35       61.76              0
## 8 1994      Yes   30.71       81.14              1

# Now we create the model
LogReg_mildew <- glm(OutbreakBinary ~ MaxTemp + RelHumidity, data=train_mildew, family="binomial")
# Look at the summary
summary(LogReg_mildew)

## 
## Call:
## glm(formula = OutbreakBinary ~ MaxTemp + RelHumidity, family = "binomial", 
##     data = train_mildew)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543    44.4573  -1.263    0.207
## MaxTemp       1.3849     1.1406   1.214    0.225
## RelHumidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5

#Now let's see the probability of having an outbreak on the validation period
mildew_predictions <- predict(LogReg_mildew, mildew[9:12,], type="response")
# See what we predicted 
# This is the probability of having a mildew outbreak
mildew_predictions

##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790

#So we can interpret these forecasts, using a .5 cutoff, as:

Year	Forecast	Actual
1995	No Outbreak	No Outbreak
1996	Outbreak	Outbreak
1997	Outbreak	No Outbreak
2000	No Outbreak	No Outbreak

So it looks like we’ve done much better with the Logistic regression, improving our accuracy to 75% with a .5 cutoff. Let’s build the confusion matrix to see what that says:

# Generate the confusion matrix
confusionMatrix(ifelse(mildew_predictions > 0.5, 1, 0), mildew[9:12,]$OutbreakBinary, positive=c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 1 1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
##

Not too shabby!

As an analyst, I might report the forecasted probabilities rather than the cutoff values. The probabilities are richer in meaning than the binaries, and could add context around a business decision.