#libraries
library(forecast)
library(caret)
library(dplyr)
library(knitr)
#load file
Mildew <- read.csv("PowderyMildewEpidemic.csv", stringsAsFactors = FALSE)
In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?
All predictors (such as temperature and humidity) must be available at the time of prediction, and the variable of Outbreak, as to whether or not an event occured.
Write an equation for the model fitted by the researchers in the form of equation(8.1). Use predictor names instead of x notation.
\[log(odds(Outbreak)) = \beta_0 + \beta_1(MaxTemp) + \beta_2(RelHumidity)\]
Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markers. Does there appear to be a relationship between epidemic status and the two predictors?
#convert Outbreak to binary
Mildew <- Mildew %>% mutate(Outbreak.Yes = ifelse(Outbreak == "Yes", 1, 0))
# Plot Outbreak using different colors with MaxTemp on x-axis and Humidity on y-axis
# Make the points solid squares (pch=15) just to see them a bit better
plot(Mildew$MaxTemp ~ Mildew$RelHumidity, xlab = "Relative Humidity", ylab = "Maximum Temperature", bty="l", col=Mildew$Outbreak.Yes+1, pch=15)
# Add a legend
legend(65, 28, c("No Outbreak", "Outbreak"), col=1:2, pch=15)
#Flip axes
plot(Mildew$RelHumidity ~ Mildew$MaxTemp, xlab = "Maximum Temperature", ylab = "Relative Humidity", bty="l", col=Mildew$Outbreak.Yes+1, pch=15)
# Add a legend
legend(27, 70, c("No Outbreak", "Outbreak"), col=1:2, pch=15)
Looking at either of the plots produced above, any relationship seems weak. That said, you can see that outbreaks occured when the temperature and the humidity were farther away from the origin of the chart, when compared to conditions where no outbreak was present. ie. you could draw a negative linear trend line on the slope of the data and find that outbreaks occurred largely on the upper-right side of the trendline, and fewer outbreaks occurred to the lower-left of the trend line.
Compute naive forecasts of epidemic status for years 1995-1997 using next-year forecasts(Ft+1 = Ft). What is the naive forecast for year 2000? Summarize the results for these four years in a classification matrix.
# Pull out 1995-1997
naive.Mildew <- Mildew$Outbreak.Yes[(length(Mildew$Outbreak.Yes)-1-3):(length(Mildew$Outbreak.Yes)-1)]
# See what they look like (Sanity check here - s/b: 1,0,1,0)
naive.Mildew
## [1] 1 0 1 0
# Confusion Matrix
confusionMatrix(naive.Mildew, Mildew$Outbreak.Yes[(length(Mildew$Outbreak.Yes)-3): length(Mildew$Outbreak.Yes)], positive = c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 1
## 1 2 0
##
## Accuracy : 0.25
## 95% CI : (0.0063, 0.8059)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.9961
##
## Kappa : -0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.0000
## Specificity : 0.3333
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.2500
## Detection Rate : 0.0000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.1667
##
## 'Positive' Class : 1
##
The accuracy of the above model is 25%. Since that may as well be as good (or worse) than guessing, I surmise that a naive forecast might not be the best solution for predicting outbreaks.
Partition the data into training and validation periods, so the years 1987-1994 are the training period. Fit a logistic regression to the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995 (use a threshold of 0.5).
# First 8 data points to be used for training period
trainMildew <- Mildew[1:8, ]
# Look at it to make sure it did what we wanted
trainMildew
## Year Outbreak MaxTemp RelHumidity Outbreak.Yes
## 1 1987 Yes 30.14 82.86 1
## 2 1988 No 30.66 79.57 0
## 3 1989 No 26.31 89.14 0
## 4 1990 Yes 28.43 91.00 1
## 5 1991 No 29.57 80.57 0
## 6 1992 Yes 31.25 67.82 1
## 7 1993 No 30.35 61.76 0
## 8 1994 Yes 30.71 81.14 1
# Now fit a logistic regression model to the training period
LogReg.Mildew <- glm(Outbreak.Yes ~ MaxTemp + RelHumidity, data=trainMildew, family="binomial")
# Look at the summary
summary(LogReg.Mildew)
##
## Call:
## glm(formula = Outbreak.Yes ~ MaxTemp + RelHumidity, family = "binomial",
## data = trainMildew)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.7466 -1.7276 -0.3132 1.0552 -1.1419 1.2419 -0.3908 0.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543 44.4573 -1.263 0.207
## MaxTemp 1.3849 1.1406 1.214 0.225
## RelHumidity 0.1877 0.1578 1.189 0.234
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.0904 on 7 degrees of freedom
## Residual deviance: 8.1198 on 5 degrees of freedom
## AIC: 14.12
##
## Number of Fisher Scoring iterations: 5
#Predict
Predict.Mildew <- predict(LogReg.Mildew, Mildew[9:12,], type="response")
# See what we predicted
# Remember this is the probability of there being an Outbreak
Predict.Mildew
## 9 10 11 12
## 0.1119407 0.7021411 0.5705413 0.3894790
#Predict just 1995
Predict.Mildew[1]
## 9
## 0.1119407
# Generate the confusion matrix
confusionMatrix(ifelse(Predict.Mildew > 0.5, 1, 0), Mildew[9:12,]$Outbreak.Yes, positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 0
## 1 1 1
##
## Accuracy : 0.75
## 95% CI : (0.1941, 0.9937)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.7383
##
## Kappa : 0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.6667
## Pos Pred Value : 0.5000
## Neg Pred Value : 1.0000
## Prevalence : 0.2500
## Detection Rate : 0.2500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8333
##
## 'Positive' Class : 1
##
1995 has a Breakout probability of 11.2%, leading us to believe that a breakout is unlikely. In addition, this forecast is 75% accurate, a significant improvement over the naive forecast, where accuracy was only 25%.