Chapter 8 Problems: Predicting whether the agricultural epidemic of powdery mildew in mango will erupt in a certain year

Mildew <- read.csv("PowderyMildewEpidemic.csv", stringsAsFactors = FALSE)
str(Mildew)
## 'data.frame':    12 obs. of  4 variables:
##  $ Year       : int  1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 ...
##  $ Outbreak   : chr  "Yes" "No" "No" "Yes" ...
##  $ MaxTemp    : num  30.1 30.7 26.3 28.4 29.6 ...
##  $ RelHumidity: num  82.9 79.6 89.1 91 80.6 ...
  1. In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?

In order for the model to serve as a forewarning system for farmers the data must be available from the same source at the time of prediction. If the time of prediction is t, then the forecasting model should include lagged versions of external information or forecasts of external information that is not available at the time of prediction. Availability refers not only to past and future data, but also delays in acquiring external data. In this example outbreak status is known by the end of March, even though outbreaks may occur in the third or fourth week. The binary outcome of interest should be related to measurements in previous periods. Enough data must be available to compare past forecasts to observations to quantify expected errors. External predictors should correlate to the series of interest and be statistically significant.

  1. Write an equation for the model fitted by the researchers in the form of equation (8.1). Use predictor names instead of x notation.

\[\log(odds(Outbreak_t)) = \beta_0 + \beta_1(MaxTemp_t) + \beta_2(RelHumidity_t)\]

  1. Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markers. Does there appear to be a relationship between epidemic status and the two predictors?
# Convert character Outbreak data to binary values
Mildew$Outbreak <- ifelse(Mildew$Outbreak == "Yes", 1, 0)

# Create scatterplot of MaxTemp and RelHumidity
par(oma = c(0,0,0,3))
plot(Mildew$RelHumidity ~ Mildew$MaxTemp, xlim = c(25,35), ylim = c(58,92), ylab = "Relative Humidity (%)", xlab = ("Maximum Temperature " ~degree~C), bty = "l", col = Mildew$Outbreak + 1, pch = 18)

# Add a legend
legend(32, 92, c("Non-Epidemic", "Epidemic"), col=1:2, pch=18, cex = 0.7)

# Another way of looking at the data (using ggplot & switching the axes)
ggplot(Mildew, aes(x = RelHumidity, y = MaxTemp, colour = Outbreak)) + geom_point(aes(colour = factor(Outbreak))) + ylab(expression("Maximum Temperature " ( degree*C))) + xlab('Relative Humidity (%)') + theme_bw() + theme(legend.key = element_blank())

It looks like there is a linear trend for the epidemic marker with a negative slope, as the temperature increases the humidity decreases. One thing that stands out though is that the epidemic markers occur at higher temperature and/or humidity than the non-epidemic ones.

  1. Compute naive forecasts of epidemic status for years 1995-1997 using next-year forecasts (\(F_{t+1} = F_t\)). What is the naive forecast for the year 2000? Summarize the results for these four years in a classification matrix.
naiveForecasts <- Mildew$Outbreak[(length(Mildew$Outbreak)-1-3):(length(Mildew$Outbreak)-1)]
naiveForecasts
## [1] 1 0 1 0

The naive forecasts for 1995 & 1997 are both Outbreak, while 1996 and 200 are both no outbreak.

confusionMatrix(naiveForecasts, Mildew$Outbreak[(length(Mildew$Outbreak)-3):length(Mildew$Outbreak)], positive=c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 1 1
##          1 2 0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : 1               
## 

The classification matrix shows that the roll-forward naive forecast only predicted 1 out of 4 things correctly, giving an accuracy of 25%. We can also see that both the positive and negative prediction values are not very good.

  1. Partition the data into training and validation periods, so that years 1987-1994 are the training period. Fit a logistic regression to the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995 (use a threshold of 0.5).
#Partition the data into training and validation periods as described above
train <- Mildew[1:8, ]

# Checking to make sure it did what it should
train
##   Year Outbreak MaxTemp RelHumidity
## 1 1987        1   30.14       82.86
## 2 1988        0   30.66       79.57
## 3 1989        0   26.31       89.14
## 4 1990        1   28.43       91.00
## 5 1991        0   29.57       80.57
## 6 1992        1   31.25       67.82
## 7 1993        0   30.35       61.76
## 8 1994        1   30.71       81.14
# Create model by fitting the logistic regression to training period
LogReg <- glm(Outbreak ~ MaxTemp + RelHumidity, data=train, family="binomial")

# Look at the summary
summary(LogReg)
## 
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543    44.4573  -1.263    0.207
## MaxTemp       1.3849     1.1406   1.214    0.225
## RelHumidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5
# Report outbreak probability
predictions <- predict(LogReg,Mildew[9:12,],type="response")

# See what the predicted probability of being above average number of outbreaks
predictions
##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790

The probability of an outbreak in 1995 is 11.2%.

# Generate the classification matrix using a cutoff of 0.5
confusionMatrix(ifelse(predictions > 0.5, 1, 0), Mildew[9:12,]$Outbreak, positive=c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 1 1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
## 
# Confusion Matrix for fitted values of training period
confusionMatrix(ifelse(LogReg$fitted > 0.5, 1, 0), train$Outbreak, positive=c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 3 1
##          1 1 3
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.3491, 0.9681)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.1445          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.750           
##             Specificity : 0.750           
##          Pos Pred Value : 0.750           
##          Neg Pred Value : 0.750           
##              Prevalence : 0.500           
##          Detection Rate : 0.375           
##    Detection Prevalence : 0.500           
##       Balanced Accuracy : 0.750           
##                                           
##        'Positive' Class : 1               
## 
# Logit transformation.
logit(predictions)
##          9         10         11         12 
## -2.0710695  0.8575143  0.2840602 -0.4495028