Chapter 8 Problems: Predicting whether the agricultural epidemic of powdery mildew in mango will erupt in a certain year
Mildew <- read.csv("PowderyMildewEpidemic.csv", stringsAsFactors = FALSE)
str(Mildew)
## 'data.frame': 12 obs. of 4 variables:
## $ Year : int 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 ...
## $ Outbreak : chr "Yes" "No" "No" "Yes" ...
## $ MaxTemp : num 30.1 30.7 26.3 28.4 29.6 ...
## $ RelHumidity: num 82.9 79.6 89.1 91 80.6 ...
In order for the model to serve as a forewarning system for farmers the data must be available from the same source at the time of prediction. If the time of prediction is t, then the forecasting model should include lagged versions of external information or forecasts of external information that is not available at the time of prediction. Availability refers not only to past and future data, but also delays in acquiring external data. In this example outbreak status is known by the end of March, even though outbreaks may occur in the third or fourth week. The binary outcome of interest should be related to measurements in previous periods. Enough data must be available to compare past forecasts to observations to quantify expected errors. External predictors should correlate to the series of interest and be statistically significant.
\[\log(odds(Outbreak_t)) = \beta_0 + \beta_1(MaxTemp_t) + \beta_2(RelHumidity_t)\]
# Convert character Outbreak data to binary values
Mildew$Outbreak <- ifelse(Mildew$Outbreak == "Yes", 1, 0)
# Create scatterplot of MaxTemp and RelHumidity
par(oma = c(0,0,0,3))
plot(Mildew$RelHumidity ~ Mildew$MaxTemp, xlim = c(25,35), ylim = c(58,92), ylab = "Relative Humidity (%)", xlab = ("Maximum Temperature " ~degree~C), bty = "l", col = Mildew$Outbreak + 1, pch = 18)
# Add a legend
legend(32, 92, c("Non-Epidemic", "Epidemic"), col=1:2, pch=18, cex = 0.7)
# Another way of looking at the data (using ggplot & switching the axes)
ggplot(Mildew, aes(x = RelHumidity, y = MaxTemp, colour = Outbreak)) + geom_point(aes(colour = factor(Outbreak))) + ylab(expression("Maximum Temperature " ( degree*C))) + xlab('Relative Humidity (%)') + theme_bw() + theme(legend.key = element_blank())
It looks like there is a linear trend for the epidemic marker with a negative slope, as the temperature increases the humidity decreases. One thing that stands out though is that the epidemic markers occur at higher temperature and/or humidity than the non-epidemic ones.
naiveForecasts <- Mildew$Outbreak[(length(Mildew$Outbreak)-1-3):(length(Mildew$Outbreak)-1)]
naiveForecasts
## [1] 1 0 1 0
The naive forecasts for 1995 & 1997 are both Outbreak, while 1996 and 200 are both no outbreak.
confusionMatrix(naiveForecasts, Mildew$Outbreak[(length(Mildew$Outbreak)-3):length(Mildew$Outbreak)], positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 1
## 1 2 0
##
## Accuracy : 0.25
## 95% CI : (0.0063, 0.8059)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.9961
##
## Kappa : -0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.0000
## Specificity : 0.3333
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.2500
## Detection Rate : 0.0000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.1667
##
## 'Positive' Class : 1
##
The classification matrix shows that the roll-forward naive forecast only predicted 1 out of 4 things correctly, giving an accuracy of 25%. We can also see that both the positive and negative prediction values are not very good.
#Partition the data into training and validation periods as described above
train <- Mildew[1:8, ]
# Checking to make sure it did what it should
train
## Year Outbreak MaxTemp RelHumidity
## 1 1987 1 30.14 82.86
## 2 1988 0 30.66 79.57
## 3 1989 0 26.31 89.14
## 4 1990 1 28.43 91.00
## 5 1991 0 29.57 80.57
## 6 1992 1 31.25 67.82
## 7 1993 0 30.35 61.76
## 8 1994 1 30.71 81.14
# Create model by fitting the logistic regression to training period
LogReg <- glm(Outbreak ~ MaxTemp + RelHumidity, data=train, family="binomial")
# Look at the summary
summary(LogReg)
##
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial",
## data = train)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.7466 -1.7276 -0.3132 1.0552 -1.1419 1.2419 -0.3908 0.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543 44.4573 -1.263 0.207
## MaxTemp 1.3849 1.1406 1.214 0.225
## RelHumidity 0.1877 0.1578 1.189 0.234
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.0904 on 7 degrees of freedom
## Residual deviance: 8.1198 on 5 degrees of freedom
## AIC: 14.12
##
## Number of Fisher Scoring iterations: 5
# Report outbreak probability
predictions <- predict(LogReg,Mildew[9:12,],type="response")
# See what the predicted probability of being above average number of outbreaks
predictions
## 9 10 11 12
## 0.1119407 0.7021411 0.5705413 0.3894790
The probability of an outbreak in 1995 is 11.2%.
# Generate the classification matrix using a cutoff of 0.5
confusionMatrix(ifelse(predictions > 0.5, 1, 0), Mildew[9:12,]$Outbreak, positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 0
## 1 1 1
##
## Accuracy : 0.75
## 95% CI : (0.1941, 0.9937)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.7383
##
## Kappa : 0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.6667
## Pos Pred Value : 0.5000
## Neg Pred Value : 1.0000
## Prevalence : 0.2500
## Detection Rate : 0.2500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8333
##
## 'Positive' Class : 1
##
# Confusion Matrix for fitted values of training period
confusionMatrix(ifelse(LogReg$fitted > 0.5, 1, 0), train$Outbreak, positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3 1
## 1 1 3
##
## Accuracy : 0.75
## 95% CI : (0.3491, 0.9681)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.1445
##
## Kappa : 0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.750
## Specificity : 0.750
## Pos Pred Value : 0.750
## Neg Pred Value : 0.750
## Prevalence : 0.500
## Detection Rate : 0.375
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.750
##
## 'Positive' Class : 1
##
# Logit transformation.
logit(predictions)
## 9 10 11 12
## -2.0710695 0.8575143 0.2840602 -0.4495028