For predicting whether the agricultural epidemic of powdery mildew in mango will erupt in a certain year in the state of Uttar Pradesh in India, Misra et al. (2004) used annual outbreak 9 A. K. Misra, O. M. Prakash, and V. Ramasubramanian. Forewarning powdery mildew caused by Oidium mangiferae in mango (Mangifera indica) using logistic regression models. Indian Journal of Agricultural Science, 74(2):84-87, 2004 records during 1987-2000. The epidemic typically occurs in the third and fourth week of March, and hence outbreak status is known by the end of March of a given year. The authors used a logistic regression model with two weather predictors (maximum temperature and relative humidity) to forecast an outbreak. The data is shown in the table below and are available in PowderyMildewEpidemic.xls.
The historical data is shown below. The relative humidity and max temp for most years between 1987-2000 are given along with whether there was a mildew outbreak early in the corresponding year.
library(pander)
library(knitr)
library(forecast)
library(caret)
library(dplyr)
setwd("/Users/Chris Iyer/Documents/")
mildew <- read.csv("PowderyMildewEpidemic.csv")
mildew <- mildew %>% mutate(Outbreak1 = ifelse(Outbreak == "Yes", 1,0))
kable(mildew)
Year | Outbreak | MaxTemp | RelHumidity | Outbreak1 |
---|---|---|---|---|
1987 | Yes | 30.14 | 82.86 | 1 |
1988 | No | 30.66 | 79.57 | 0 |
1989 | No | 26.31 | 89.14 | 0 |
1990 | Yes | 28.43 | 91.00 | 1 |
1991 | No | 29.57 | 80.57 | 0 |
1992 | Yes | 31.25 | 67.82 | 1 |
1993 | No | 30.35 | 61.76 | 0 |
1994 | Yes | 30.71 | 81.14 | 1 |
1995 | No | 30.71 | 61.57 | 0 |
1996 | Yes | 33.07 | 59.76 | 1 |
1997 | No | 31.50 | 68.29 | 0 |
2000 | No | 29.50 | 79.14 | 0 |
Is the mildew following a trend?
I plotted the data in a couple of different ways to visualize trends and the relationships between the variables and the mildew’s growth.
In the side by side plots below, I show each predictor variable seperately over time, highlightng the outbreak years with red points.
library(ggplot2)
tempPlot <- ggplot(data=mildew, aes(x=Year, y=MaxTemp)) +
geom_line() +
annotate("point", x = c(1987, 1990, 1992, 1994, 1996), y = c(30.14, 28.43, 31.25, 30.71,33.07),colour = "red", size = 2.5)
humidityPlot <- ggplot(data=mildew, aes(x=Year, y=RelHumidity)) +
geom_line() +
annotate("point", x = c(1987, 1990, 1992, 1994, 1996), y = c(82.86, 91.00, 67.82, 81.14,59.76),colour = "red", size = 2.5)
library(gridExtra)
grid.arrange(tempPlot, humidityPlot, ncol=2)
Since I can’t find any obvious explanation for mildew outbreaks, I tried to plot the two predictor variables on a single plot, standardizing them with their z-scores. They are plotted below; the 5 vertical lines represent years where there was a mildew outbreak.
mildewZ <- mildew %>% mutate(RelHumZ= scale(RelHumidity, center = TRUE, scale = TRUE), MaxTempZ = scale(MaxTemp, center = TRUE, scale = TRUE))
mildewZ
bothPlot <-ggplot(mildewZ, aes(Year)) +
geom_line(aes(y = RelHumZ, color = RelHumZ), size = 1, colour = "red") +
geom_line(aes(y = MaxTempZ, color = MaxTempZ), size = 1, colour = "red") +
geom_vline(xintercept=c(1987, 1990, 1992, 1994, 1996), size = 1.5) +
annotate("text", x = 1998, y = -0.9352666, label = "Relative \nHumidity") +
annotate("text", x = 1998, y = 0.83512904, label = "Temp") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
scale_x_discrete(name ="Dose (mg)",
limits=c(1987, 1990, 1992, 1994, 1996, 2000))
bothPlot
Answer: In order to generate a forecast using logistic regression, the measurement of the object of interest can either be numerical or binary, however since the forecast is binary, if the object of interest is numerical, the forecaster has to create a derived variable. In the case of predicting whether the powdery mildew will occur in Uttar Pradesh, India, we are given a dataset with the measurements of two predictor variables and a binary yes/no for the outbreak outcome. The outbreak variable has to be converted to a 0,1 for no/yes outcomes. The 0,1 is the derived variable.
Most importantly, the predictor variables must be available at the time of the prediction of the event.
\(log(odds(Mildew)_t=\beta_0 + \beta_{mildew_{t-1}} + \beta_{humidity_{t-1}} + \beta_{maxTemp_{t-1}}\)
plot(mildew$MaxTemp ~ mildew$RelHumidity, xlab = "Relative Humidity", ylab = "Max Temp (degrees Celsius)", col = mildew$Outbreak1 + 1, bty = "l", pch = 15)
legend(60, 29, c("No Outbreak", "Outbreak"), col = 1:2, pch = 15, bty = "l")
This is the same plot with the axes reversed.
plot(mildew$RelHumidity ~ mildew$MaxTemp, ylab = "Relative Humidity", xlab = "Max Temp (degrees Celsius)", col = mildew$Outbreak1 + 1, bty = "l", pch = 15)
legend(26.5,72, c("No Outbreak", "Outbreak"), col = 1:2, pch = 15, bty = "l")
Based on the naive model, the forecast for 2000, is for no mildew outbreak.
naiveForecasts <- mildew$Outbreak1[(length(mildew$Outbreak1)-1-3) : (length(mildew$Outbreak1)-1)]
Year <- c(1995, 1996, 1997, 2000)
MildewYear <- cbind(Year, naiveForecasts)
kable(MildewYear)
Year | naiveForecasts |
---|---|
1995 | 1 |
1996 | 0 |
1997 | 1 |
2000 | 0 |
Confusion Matrix
Based on the matrix below, the accuracy of the naive method is 25%, less reliable than the flip of a coin. This is not the best forecasting model.
confusionMatrix(naiveForecasts, mildew$Outbreak1[(length(mildew$Outbreak1)-3): length(mildew$Outbreak1)], positive = c("1"))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1 1
1 2 0
Accuracy : 0.25
95% CI : (0.0063, 0.8059)
No Information Rate : 0.75
P-Value [Acc > NIR] : 0.9961
Kappa : -0.5
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.0000
Specificity : 0.3333
Pos Pred Value : 0.0000
Neg Pred Value : 0.5000
Prevalence : 0.2500
Detection Rate : 0.0000
Detection Prevalence : 0.5000
Balanced Accuracy : 0.1667
'Positive' Class : 1
trainOutMildew <- mildew[1:8,]
kable(trainOutMildew)
Year | Outbreak | MaxTemp | RelHumidity | Outbreak1 |
---|---|---|---|---|
1987 | Yes | 30.14 | 82.86 | 1 |
1988 | No | 30.66 | 79.57 | 0 |
1989 | No | 26.31 | 89.14 | 0 |
1990 | Yes | 28.43 | 91.00 | 1 |
1991 | No | 29.57 | 80.57 | 0 |
1992 | Yes | 31.25 | 67.82 | 1 |
1993 | No | 30.35 | 61.76 | 0 |
1994 | Yes | 30.71 | 81.14 | 1 |
outLogRegMildew <- glm(Outbreak1 ~ MaxTemp + RelHumidity, data = trainOutMildew, family = "binomial")
summary(outLogRegMildew)
Call:
glm(formula = Outbreak1 ~ MaxTemp + RelHumidity, family = "binomial",
data = trainOutMildew)
Deviance Residuals:
1 2 3 4 5
0.7466 -1.7276 -0.3132 1.0552 -1.1419
6 7 8
1.2419 -0.3908 0.6060
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -56.1543 44.4573 -1.263 0.207
MaxTemp 1.3849 1.1406 1.214 0.225
RelHumidity 0.1877 0.1578 1.189 0.234
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11.0904 on 7 degrees of freedom
Residual deviance: 8.1198 on 5 degrees of freedom
AIC: 14.12
Number of Fisher Scoring iterations: 5
predictionsMildew <- predict(outLogRegMildew, mildew[9:12,],type = "response")
logRegForecast <- cbind(Year, predictionsMildew)
kable(logRegForecast, row.names = FALSE)
Year | predictionsMildew |
---|---|
1995 | 0.1119407 |
1996 | 0.7021411 |
1997 | 0.5705413 |
2000 | 0.3894790 |
Cutoff value = 0.5
confusionMatrix(ifelse(predictionsMildew > 0.5, 1,0), mildew[9:12,]$Outbreak1, positive = c("1"))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2 0
1 1 1
Accuracy : 0.75
95% CI : (0.1941, 0.9937)
No Information Rate : 0.75
P-Value [Acc > NIR] : 0.7383
Kappa : 0.5
Mcnemar's Test P-Value : 1.0000
Sensitivity : 1.0000
Specificity : 0.6667
Pos Pred Value : 0.5000
Neg Pred Value : 1.0000
Prevalence : 0.2500
Detection Rate : 0.2500
Detection Prevalence : 0.5000
Balanced Accuracy : 0.8333
'Positive' Class : 1
The probability of a mildew outbreak in 1995 is 11.2% using a cutoff of 0.5. The accuracty of the model is 75%, much better than that of the naive model. Forecasting the probability that the mildew will grow with a logistic regression model is preferible than forecasting with a naive model in this case given the data set.